Enabling the AtomicAdd Option for Matmul

[Priority] Medium

[Description] If the result matrix C (m, n) obtained by Matmul needs to be added to the matrix D (m, n) on the GM, the enAtomic parameter in the GetTensorC or IterateAll API of the GM can be set to 1 to enable the AtomicAdd operation. After matrix C is transferred out to the GM, the result of matrix C is directly accumulated to the GM address of matrix D, implementing the Add operation with matrix D.

[Negative Example]

The result matrix C of Matmul and matrix D of the GM are transferred to the UB respectively. After the Add operation is complete, the result is transferred out of the GM. In this way, at least one more UB needs to be allocated to matrix D. Assuming that the operation is performed on the processor of the separated architecture, three more transfers will be performed (matrix C is transferred from the GM to the UB, matrix D is transferred from the GM to the UB, and the Add result is transferred from the UB to the GM).

 template <class A_TYPE, class B_TYPE, class C_TYPE, class BIAS_TYPE>
 __aicore__ inline void MatMulKernel(...)
 {
    ...
    Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_MDL> mm;
    TPipe pipe;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm);

    mm.SetTensorA(gm_a);
    mm.SetTensorB(gm_b);
    mm.SetBias(gm_bias);

    mm.IterateAll(local_c);

    // while (mm.Iterate()) {
        // mm.GetTensorC(local_c);
    // }
    
    DataCopy(local_d, gm_d, d_size);
    event_t eventIdMTE2ToV = static_cast<event_t>(GetTPipePtr()->FetchEventID(HardEvent::MTE2_V));
    SetFlag<HardEvent::MTE2_V>(eventIdMTE2ToV);
    WaitFlag<HardEvent::MTE2_V>(eventIdMTE2ToV);
    Add(local_d, local_d, local_c, d_size);
    DataCopy(gm_d, local_d, d_size);
    ...
 }

 extern "C" __global__ __aicore__ void example_kernel(...)
 {
     ...
     typedef MatmulType<TPosition::GM, CubeFormat::ND, half> aType; 
     typedef MatmulType<TPosition::GM, CubeFormat::ND, half> bType; 
     typedef MatmulType<TPosition::GM, CubeFormat::ND, float> cType; 
     typedef MatmulType<TPosition::GM, CubeFormat::ND, float> biasType;
     MatMulKernel<aType, bType, cType, biasType)(...);
     ...
 }

[Positive Example]

When the Matmul result is computed, the IterateAll or GetTensorC API is called to transfer the result to the GM address of matrix D, and the enAtomic parameter in the API is set to 1. After the result is transferred to the GM, the Matmul result matrix C is accumulated to matrix D, so as to obtain the result after the two matrices are added.

 template <class A_TYPE, class B_TYPE, class C_TYPE, class BIAS_TYPE>
 __aicore__ inline void MatMulKernel(...)
 {
    ...
    Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_MDL> mm;
    TPipe pipe;
    REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm);

    mm.SetTensorA(gm_a);
    mm.SetTensorB(gm_b);
    mm.SetBias(gm_bias);

    mm.IterateAll(gm_d, 1); // Set enAtomic in the IterateAll API to 1.
    
    // while (mm. Iterate ()) {
        // mm.GetTensorC(gm_d, 1);     // Set enAtomic in the GetTensorC API to 1.
    // }
    ...
 }

 extern "C" __global__ __aicore__ void example_kernel(...)
 {
     ...
     typedef MatmulType<TPosition::GM, CubeFormat::ND, half> aType; 
     typedef MatmulType<TPosition::GM, CubeFormat::ND, half> bType; 
     typedef MatmulType<TPosition::GM, CubeFormat::ND, float> cType; 
     typedef MatmulType<TPosition::GM, CubeFormat::ND, float> biasType;
     MatMulKernel<aType, bType, cType, biasType)(...);
     ...
 }

[Performance Comparison]

Figure 1 Performance comparison before and after the AtomicAdd option is enabled for Matmul

Assume that the matrix dimensions are M = 64, N = 256, K = 256, and matrix D is (64, 256). The preceding figure shows the performance comparison before and after the AtomicAdd option is enabled. The average number of cycles changes from 154181 to 135054 after the AtomicAdd option is enabled, with performance improved by 12.4%. Therefore, in this scenario, enabling the AtomicAdd option can achieve better performance.