Enabling the AtomicAdd Option for Matmul
[Priority] Medium
[Description] If the result matrix C (m, n) obtained by Matmul needs to be added to the matrix D(m, n) on the GM, the enAtomic parameter in the GetTensorC or IterateAll API of the GM can be set to 1 to enable the AtomicAdd operation. After matrix C is transferred out to the GM, the result of matrix C is directly accumulated to the GM address of matrix D, implementing the Add operation with matrix D.
[Negative Example]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | template <class A_TYPE, class B_TYPE, class C_TYPE, class BIAS_TYPE> __aicore__ inline void MatMulKernel(...) { ... AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_MDL> mm; TPipe pipe; REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm); mm.SetTensorA(gm_a); mm.SetTensorB(gm_b); mm.SetBias(gm_bias); mm.IterateAll(gm_c); DataCopy(local_c, gm_c, c_size); DataCopy(local_d, gm_d, d_size); event_t eventIdMTE2ToV = static_cast<event_t>(GetTPipePtr()->FetchEventID(HardEvent::MTE2_V)); SetFlag<HardEvent::MTE2_V>(eventIdMTE2ToV); WaitFlag<HardEvent::MTE2_V>(eventIdMTE2ToV); Add(local_d, local_d, local_c, d_size); DataCopy(gm_d, local_d, d_size); ... } extern "C" __global__ __aicore__ void example_kernel(...) { ... typedef AscendC::MatmulType<TPosition::GM, CubeFormat::ND, half> aType; typedef AscendC::MatmulType<TPosition::GM, CubeFormat::ND, half> bType; typedef AscendC::MatmulType<TPosition::GM, CubeFormat::ND, float> cType; typedef AscendC::MatmulType<TPosition::GM, CubeFormat::ND, float> biasType; MatMulKernel<aType, bType, cType, biasType>(...); ... } |
[Positive Example]
When the Matmul result is computed, the IterateAll or GetTensorC API is called to transfer the result to the GM address of matrix D, and the enAtomic parameter in the API is set to 1. After the result is transferred to the GM, the Matmul result matrix C is accumulated to matrix D, so as to obtain the result after the two matrices are added.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | template <class A_TYPE, class B_TYPE, class C_TYPE, class BIAS_TYPE> __aicore__ inline void MatMulKernel(...) { ... AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_MDL> mm; TPipe pipe; REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm); mm.SetTensorA(gm_a); mm.SetTensorB(gm_b); mm.SetBias(gm_bias); mm.IterateAll(gm_d, 1); // Set enAtomic in the IterateAll API to 1. // while (mm. Iterate()) { // mm.GetTensorC(gm_d, 1); // Set enAtomic in the GetTensorC API to 1. // } ... } extern "C" __global__ __aicore__ void example_kernel(...) { ... typedef AscendC::MatmulType<TPosition::GM, CubeFormat::ND, half> aType; typedef AscendC::MatmulType<TPosition::GM, CubeFormat::ND, half> bType; typedef AscendC::MatmulType<TPosition::GM, CubeFormat::ND, float> cType; typedef AscendC::MatmulType<TPosition::GM, CubeFormat::ND, float> biasType; MatMulKernel<aType, bType, cType, biasType>(...); ... } |
[Performance Comparison]

Assume that the matrix dimensions are M = 64, N = 256, K = 256, and matrix D is (64, 256). The preceding figure shows the performance comparison before and after the AtomicAdd option is enabled. The average number of cycles changes from 154181 to 135054 after the AtomicAdd option is enabled, with performance improved by 12.4%. Therefore, in this scenario, enabling the AtomicAdd option can achieve better performance.