Fill

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Initializes data in the global memory to a specified value. This API can be used to clear the workspace address or output data.

Prototype

template <typename T>
__aicore__ inline void Fill(GlobalTensor<T>& gmWorkspaceAddr, const uint64_t size, const T value)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the operand. For the Atlas A3 training products/Atlas A3 inference products, the supported data types are uint16_t, int16_t, half, uint32_t, int32_t, and float. For the Atlas A2 training products/Atlas A2 inference products, the supported data types are uint16_t, int16_t, half, uint32_t, int32_t, and float. For the Atlas inference product's AI Core, the supported data types are uint16_t, int16_t, half, uint32_t, int32_t, and float.

**Table 2** API parameters
Parameter	Input/Output	Description
gmWorkspaceAddr	Input	User-defined global space, which needs to be initialized. The type is GlobalTensor. For details about the definition of the GlobalTensor data structure, see GlobalTensor.
size	Input	Size of the space to be initialized. The unit is the number of elements.
value	Input	Initialized value. The supported data types are the same as those of gmWorkspaceAddr.

Returns

None

Restrictions

When a single core calls this API and the subsequent operations involve the usage of the Unified Buffer, the synchronization for the MTE2 pipeline to wait for the MTE3 pipeline ( MTE3_MTE2) needs to be set after the API is called.
When multiple cores call this API to initialize data in the global memory, the initialization may not complete simultaneously across all cores. Additionally, data dependency issues such as read-after-write, write-after-read, and write-after-write may occur between cores. In this scenario, you can call the SyncAll API after this API to ensure correct synchronization between multiple cores.
This API can be used only before the InitBuffer API is called to allocate the program memory.

Example

In this example, eight cores are used. Each core uses the current blockIdx value to initialize 65536 pieces of data on zGm. The intra-core calculation for each core sums 65536 pieces of half-type data, all 1s, in the x and y groups, and the result is accumulated to zGm. The blockIdx values for the eight cores range from 0 to 7. If the input x and y values are all 1s, the final output data of zGm ranges from 2 to 9.

#include "kernel_operator.h"

constexpr int32_t INIT_SIZE = 65536;

class KernelFill {
public:
    __aicore__ inline KernelFill() {}
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, TPipe* pipe)
    {
        xGm.SetGlobalBuffer((__gm__ half*)x + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE);
        yGm.SetGlobalBuffer((__gm__ half*)y + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE);
        zGm.SetGlobalBuffer((__gm__ half*)z + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE);
        // init zGm value
        AscendC::Fill(zGm, INIT_SIZE, (half)(AscendC::GetBlockIdx()));

        AscendC::TEventID eventIdMTE3ToMTE2 = GetTPipePtr()->FetchEventID(AscendC::HardEvent::MTE3_MTE2);
        AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(eventIdMTE3ToMTE2);
        AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(eventIdMTE3ToMTE2);

        pipe->InitBuffer(inQueueX, 1, INIT_SIZE * sizeof(half));
        pipe->InitBuffer(inQueueY, 1, INIT_SIZE * sizeof(half));
        pipe->InitBuffer(outQueueZ, 1, INIT_SIZE * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
        AscendC::DataCopy(xLocal, xGm, INIT_SIZE);
        AscendC::DataCopy(yLocal, yGm, INIT_SIZE);
        inQueueX.EnQue(xLocal);
        inQueueY.EnQue(yLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
        AscendC::Add(zLocal, xLocal, yLocal, INIT_SIZE);
        outQueueZ.EnQue<half>(zLocal);
        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
        // add result to zGm
        AscendC::SetAtomicAdd<half>();
        AscendC::DataCopy(zGm, zLocal, INIT_SIZE);
        AscendC::SetAtomicNone();
        outQueueZ.FreeTensor(zLocal);
    }
private:
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX, inQueueY;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueZ;
    AscendC::GlobalTensor<half> xGm;
    AscendC::GlobalTensor<half> yGm;
    AscendC::GlobalTensor<half> zGm;
};

extern "C" __global__ __aicore__ void init_global_memory_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    KernelFill op;
    TPipe pipe;
    op.Init(x, y, z, &pipe);
    op.Process();
}

Result example:

Input (x):
[1. 1. 1. 1. 1. ... 1.]
Input (y):
[1. 1. 1. 1. 1. ... 1.]
Output (z):
[2. 2. 2. 2. 2. ... 2.
3. 3. 3. 3. 3. ... 3.
4. 4. 4. 4. 4. ... 4.
5. 5. 5. 5. 5. ... 5.
6. 6. 6. 6. 6. ... 6.
7. 7. 7. 7. 7. ... 7.
8. 8. 8. 8. 8. ... 8.
9. 9. 9. 9. 9. ... 9.]

Parent topic: Tensor Transformation