InitGlobalMemory

Function Usage

Initializes data in the global memory to a specified value. This API can be used to clear the workspace address or output data.

Prototype

1
2
template <typename T>
__aicore__ inline void InitGlobalMemory(GlobalTensor<T>& gmWorkspaceAddr, const uint64_t size, const T value)

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the operand.

Table 2 API parameters

Parameter

Input/Output

Meaning

gmWorkspaceAddr

Input

User-defined global space, which needs to be initialized. The type is GlobalTensor. For details about the definition of the GlobalTensor data structure, see GlobalTensor.

size

Input

Size of the space to be initialized. The unit is the number of elements.

value

Input

Initialized value. The supported data types are the same as those of gmWorkspaceAddr.

Returns

None

Availability

Constraints

  • When multiple cores call this API to initialize data in the global memory, the initialization may not complete simultaneously across all cores. Additionally, data dependency issues such as read-after-write, write-after-read, and write-after-write may occur between cores. In this scenario, you can call the SyncAll API after this API to ensure correct synchronization between multiple cores.
  • This API can be used only before the InitBuffer API is called to allocate the program memory.

Example

In this example, eight cores are used. Each core uses the current blockIdx value to initialize 65536 pieces of data on zGm. The intra-core calculation for each core sums 65536 pieces of half-type data, all 1s, in the x and y groups, and the result is accumulated to zGm. The blockIdx values for the eight cores range from 0 to 7. If the input x and y values are all 1s, the final output data of zGm ranges from 2 to 9.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#include "kernel_operator.h"

constexpr int32_t INIT_SIZE = 65536;

class KernelInitGlobalMemory {
public:
    __aicore__ inline KernelInitGlobalMemory() {}
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, TPipe* pipe)
    {
        xGm.SetGlobalBuffer((__gm__ half*)x + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE);
        yGm.SetGlobalBuffer((__gm__ half*)y + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE);
        zGm.SetGlobalBuffer((__gm__ half*)z + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE);
        // init zGm value
        AscendC::InitGlobalMemory(zGm, INIT_SIZE, (half)(AscendC::GetBlockIdx()));
        pipe->InitBuffer(inQueueX, 1, INIT_SIZE * sizeof(half));
        pipe->InitBuffer(inQueueY, 1, INIT_SIZE * sizeof(half));
        pipe->InitBuffer(outQueueZ, 1, INIT_SIZE * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
        AscendC::DataCopy(xLocal, xGm, INIT_SIZE);
        AscendC::DataCopy(yLocal, yGm, INIT_SIZE);
        inQueueX.EnQue(xLocal);
        inQueueY.EnQue(yLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
        AscendC::Add(zLocal, xLocal, yLocal, INIT_SIZE);
        outQueueZ.EnQue<half>(zLocal);
        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
        // add result to zGm
        AscendC::SetAtomicAdd<half>();
        AscendC::DataCopy(zGm, zLocal, INIT_SIZE);
        AscendC::SetAtomicNone();
        outQueueZ.FreeTensor(zLocal);
    }
private:
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX, inQueueY;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueZ;
    AscendC::GlobalTensor<half> xGm;
    AscendC::GlobalTensor<half> yGm;
    AscendC::GlobalTensor<half> zGm;
};

extern "C" __global__ __aicore__ void init_global_memory_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    KernelInitGlobalMemory op;
    TPipe pipe;
    op.Init(x, y, z, &pipe);
    op.Process();
}

Result example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Input (x):
[1. 1. 1. 1. 1. ... 1.]
Input (y):
[1. 1. 1. 1. 1. ... 1.]
Output (z):
[2. 2. 2. 2. 2. ... 2.
3. 3. 3. 3. 3. ... 3.
4. 4. 4. 4. 4. ... 4.
5. 5. 5. 5. 5. ... 5.
6. 6. 6. 6. 6. ... 6.
7. 7. 7. 7. 7. ... 7.
8. 8. 8. 8. 8. ... 8.
9. 9. 9. 9. 9. ... 9.]