Fill

Applicability

Product

Supported

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product's AI Core

Atlas inference product's Vector Core

x

Atlas training products

x

Function

Initializes data in the global memory to a specified value. This API can be used to clear the workspace address or output data.

Prototype

1
2
template <typename T>
__aicore__ inline void Fill(GlobalTensor<T>& gmWorkspaceAddr, const uint64_t size, const T value)

Parameters

Table 1 Template parameters

Parameter

Description

T

Data type of the operand.

For the Atlas A3 training products/Atlas A3 inference products, the supported data types are uint16_t, int16_t, half, uint32_t, int32_t, and float.

For the Atlas A2 training products/Atlas A2 inference products, the supported data types are uint16_t, int16_t, half, uint32_t, int32_t, and float.

For the Atlas inference product's AI Core, the supported data types are uint16_t, int16_t, half, uint32_t, int32_t, and float.

Table 2 API parameters

Parameter

Input/Output

Description

gmWorkspaceAddr

Input

User-defined global space, which needs to be initialized. The type is GlobalTensor. For details about the definition of the GlobalTensor data structure, see GlobalTensor.

size

Input

Size of the space to be initialized. The unit is the number of elements.

value

Input

Initialized value. The supported data types are the same as those of gmWorkspaceAddr.

Returns

None

Restrictions

  • When a single core calls this API and the subsequent operations involve the usage of the Unified Buffer, the synchronization for the MTE2 pipeline to wait for the MTE3 pipeline ( MTE3_MTE2) needs to be set after the API is called.
  • When multiple cores call this API to initialize data in the global memory, the initialization may not complete simultaneously across all cores. Additionally, data dependency issues such as read-after-write, write-after-read, and write-after-write may occur between cores. In this scenario, you can call the SyncAll API after this API to ensure correct synchronization between multiple cores.
  • This API can be used only before the InitBuffer API is called to allocate the program memory.

Example

In this example, eight cores are used. Each core uses the current blockIdx value to initialize 65536 pieces of data on zGm. The intra-core calculation for each core sums 65536 pieces of half-type data, all 1s, in the x and y groups, and the result is accumulated to zGm. The blockIdx values for the eight cores range from 0 to 7. If the input x and y values are all 1s, the final output data of zGm ranges from 2 to 9.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#include "kernel_operator.h"

constexpr int32_t INIT_SIZE = 65536;

class KernelFill {
public:
    __aicore__ inline KernelFill() {}
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, TPipe* pipe)
    {
        xGm.SetGlobalBuffer((__gm__ half*)x + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE);
        yGm.SetGlobalBuffer((__gm__ half*)y + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE);
        zGm.SetGlobalBuffer((__gm__ half*)z + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE);
        // init zGm value
        AscendC::Fill(zGm, INIT_SIZE, (half)(AscendC::GetBlockIdx()));

        AscendC::TEventID eventIdMTE3ToMTE2 = GetTPipePtr()->FetchEventID(AscendC::HardEvent::MTE3_MTE2);
        AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(eventIdMTE3ToMTE2);
        AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(eventIdMTE3ToMTE2);

        pipe->InitBuffer(inQueueX, 1, INIT_SIZE * sizeof(half));
        pipe->InitBuffer(inQueueY, 1, INIT_SIZE * sizeof(half));
        pipe->InitBuffer(outQueueZ, 1, INIT_SIZE * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
        AscendC::DataCopy(xLocal, xGm, INIT_SIZE);
        AscendC::DataCopy(yLocal, yGm, INIT_SIZE);
        inQueueX.EnQue(xLocal);
        inQueueY.EnQue(yLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
        AscendC::Add(zLocal, xLocal, yLocal, INIT_SIZE);
        outQueueZ.EnQue<half>(zLocal);
        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
        // add result to zGm
        AscendC::SetAtomicAdd<half>();
        AscendC::DataCopy(zGm, zLocal, INIT_SIZE);
        AscendC::SetAtomicNone();
        outQueueZ.FreeTensor(zLocal);
    }
private:
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX, inQueueY;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueZ;
    AscendC::GlobalTensor<half> xGm;
    AscendC::GlobalTensor<half> yGm;
    AscendC::GlobalTensor<half> zGm;
};

extern "C" __global__ __aicore__ void init_global_memory_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    KernelFill op;
    TPipe pipe;
    op.Init(x, y, z, &pipe);
    op.Process();
}

Result example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Input (x):
[1. 1. 1. 1. 1. ... 1.]
Input (y):
[1. 1. 1. 1. 1. ... 1.]
Output (z):
[2. 2. 2. 2. 2. ... 2.
3. 3. 3. 3. 3. ... 3.
4. 4. 4. 4. 4. ... 4.
5. 5. 5. 5. 5. ... 5.
6. 6. 6. 6. 6. ... 6.
7. 7. 7. 7. 7. ... 7.
8. 8. 8. 8. 8. ... 8.
9. 9. 9. 9. 9. ... 9.]