InitGlobalMemory
Function Usage
Initializes data in the global memory to a specified value. This API can be used to clear the workspace address or output data.
Prototype
1 2 | template <typename T> __aicore__ inline void InitGlobalMemory(GlobalTensor<T>& gmWorkspaceAddr, const uint64_t size, const T value) |
Parameters
Parameter |
Description |
|---|---|
T |
Data type of the operand. |
Parameter |
Input/Output |
Meaning |
|---|---|---|
gmWorkspaceAddr |
Input |
User-defined global space, which needs to be initialized. The type is GlobalTensor. For details about the definition of the GlobalTensor data structure, see GlobalTensor. |
size |
Input |
Size of the space to be initialized. The unit is the number of elements. |
value |
Input |
Initialized value. The supported data types are the same as those of gmWorkspaceAddr. |
Returns
None
Availability
Constraints
- When multiple cores call this API to initialize data in the global memory, the initialization may not complete simultaneously across all cores. Additionally, data dependency issues such as read-after-write, write-after-read, and write-after-write may occur between cores. In this scenario, you can call the SyncAll API after this API to ensure correct synchronization between multiple cores.
- This API can be used only before the InitBuffer API is called to allocate the program memory.
Example
In this example, eight cores are used. Each core uses the current blockIdx value to initialize 65536 pieces of data on zGm. The intra-core calculation for each core sums 65536 pieces of half-type data, all 1s, in the x and y groups, and the result is accumulated to zGm. The blockIdx values for the eight cores range from 0 to 7. If the input x and y values are all 1s, the final output data of zGm ranges from 2 to 9.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | #include "kernel_operator.h" constexpr int32_t INIT_SIZE = 65536; class KernelInitGlobalMemory { public: __aicore__ inline KernelInitGlobalMemory() {} __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, TPipe* pipe) { xGm.SetGlobalBuffer((__gm__ half*)x + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE); yGm.SetGlobalBuffer((__gm__ half*)y + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE); zGm.SetGlobalBuffer((__gm__ half*)z + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE); // init zGm value AscendC::InitGlobalMemory(zGm, INIT_SIZE, (half)(AscendC::GetBlockIdx())); pipe->InitBuffer(inQueueX, 1, INIT_SIZE * sizeof(half)); pipe->InitBuffer(inQueueY, 1, INIT_SIZE * sizeof(half)); pipe->InitBuffer(outQueueZ, 1, INIT_SIZE * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>(); AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>(); AscendC::DataCopy(xLocal, xGm, INIT_SIZE); AscendC::DataCopy(yLocal, yGm, INIT_SIZE); inQueueX.EnQue(xLocal); inQueueY.EnQue(yLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>(); AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>(); AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>(); AscendC::Add(zLocal, xLocal, yLocal, INIT_SIZE); outQueueZ.EnQue<half>(zLocal); inQueueX.FreeTensor(xLocal); inQueueY.FreeTensor(yLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>(); // add result to zGm AscendC::SetAtomicAdd<half>(); AscendC::DataCopy(zGm, zLocal, INIT_SIZE); AscendC::SetAtomicNone(); outQueueZ.FreeTensor(zLocal); } private: AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX, inQueueY; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueZ; AscendC::GlobalTensor<half> xGm; AscendC::GlobalTensor<half> yGm; AscendC::GlobalTensor<half> zGm; }; extern "C" __global__ __aicore__ void init_global_memory_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z) { KernelInitGlobalMemory op; TPipe pipe; op.Init(x, y, z, &pipe); op.Process(); } |
Result example:
1 2 3 4 5 6 7 8 9 10 11 12 13 | Input (x): [1. 1. 1. 1. 1. ... 1.] Input (y): [1. 1. 1. 1. 1. ... 1.] Output (z): [2. 2. 2. 2. 2. ... 2. 3. 3. 3. 3. 3. ... 3. 4. 4. 4. 4. 4. ... 4. 5. 5. 5. 5. 5. ... 5. 6. 6. 6. 6. 6. ... 6. 7. 7. 7. 7. 7. ... 7. 8. 8. 8. 8. 8. ... 8. 9. 9. 9. 9. 9. ... 9.] |
Parent topic: Tool