SetAtomicAdd

Function Usage

Sets atomic accumulation for subsequent data transmission from VECOUT/L0C/L1 to GM. Different accumulation data types can be set by using template parameters.

Prototype

1
2
template <typename T>
__aicore__ inline void SetAtomicAdd() {}

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Set different addition data types.

For the Atlas Training Series Product, the supported data type is float, and the supported data path is VECOUT -> GM.

Returns

None

Availability

Atlas Training Series Product

Precautions

  • You are advised to disable atomic addition using SetAtomicNone to avoid affecting subsequent instruction functions.
  • Before the instruction is executed, the GM data is not cleared. Developers can determine whether to clear the GM data based on the actual operator logic and clear the GM data as required.

Example

In this example, atomic accumulation is performed when DataCopy is used to move data from VECOUT to external dstGlobal. To ensure the correctness of atomic accumulation, you need to clear dstGm before calling the kernel function.

When the kernel function is called, blockDim is set to 3. The following is an example of calling the kernel function:

1
2
3
4
...
// x is the input, and z is the output.
set_atomic_add_ops_kernel<<<3, nullptr, stream>>>(x, z);
...

The following is an example of the kernel function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#include "kernel_operator.h"
class KernelSetAtomicAdd {
public:
    __aicore__ inline KernelSetAtomicAdd() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ float*)src0Gm);
        dstGlobal.SetGlobalBuffer((__gm__ float*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 256 * sizeof(float));
        pipe.InitBuffer(outQueueDst, 1, 256 * sizeof(float));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {}
    __aicore__ inline void Compute()
    {}
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>();
        AscendC::SetAtomicNone();
        AscendC::DataCopy(src0Local, src0Global, 256);
        AscendC::SetFlag<AscendC::HardEvent::MTE2_MTE3>(0);
        AscendC::WaitFlag<AscendC::HardEvent::MTE2_MTE3>(0);
        
        AscendC::SetAtomicAdd<float>();
        AscendC::DataCopy(dstGlobal, src0Local, 256);
        AscendC::SetAtomicNone();
        inQueueSrc0.FreeTensor(src0Local);
}
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<float> src0Global, dstGlobal;
};
extern "C" __global__ __aicore__ void set_atomic_add_ops_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
{
    KernelSetAtomicAdd op;
    op.Init(src0Gm, dstGm);
    op.Process();
}

Result example:

Input data Src0 of each core: [1,1,1,1,1,...,1] // 1 × 256
Final output data dstGm: [3,3,3,3,3,...,3] // 3 × 256