Fill
Applicability
Product |
Supported |
|---|---|
√ |
|
√ |
|
x |
|
√ |
|
x |
|
x |
Function
Initializes data in the global memory to a specified value. This API can be used to clear the workspace address or output data.
Prototype
1 2 | template <typename T> __aicore__ inline void Fill(GlobalTensor<T>& gmWorkspaceAddr, const uint64_t size, const T value) |
Parameters
Parameter |
Description |
|---|---|
T |
Data type of the operand. For the For the For the |
Parameter |
Input/Output |
Description |
|---|---|---|
gmWorkspaceAddr |
Input |
User-defined global space, which needs to be initialized. The type is GlobalTensor. For details about the definition of the GlobalTensor data structure, see GlobalTensor. |
size |
Input |
Size of the space to be initialized. The unit is the number of elements. |
value |
Input |
Initialized value. The supported data types are the same as those of gmWorkspaceAddr. |
Returns
None
Restrictions
- When a single core calls this API and the subsequent operations involve the usage of the Unified Buffer, the synchronization for the MTE2 pipeline to wait for the MTE3 pipeline ( MTE3_MTE2) needs to be set after the API is called.
- When multiple cores call this API to initialize data in the global memory, the initialization may not complete simultaneously across all cores. Additionally, data dependency issues such as read-after-write, write-after-read, and write-after-write may occur between cores. In this scenario, you can call the SyncAll API after this API to ensure correct synchronization between multiple cores.
- This API can be used only before the InitBuffer API is called to allocate the program memory.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | #include "kernel_operator.h" constexpr int32_t INIT_SIZE = 65536; class KernelFill { public: __aicore__ inline KernelFill() {} __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, TPipe* pipe) { xGm.SetGlobalBuffer((__gm__ half*)x + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE); yGm.SetGlobalBuffer((__gm__ half*)y + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE); zGm.SetGlobalBuffer((__gm__ half*)z + INIT_SIZE * AscendC::GetBlockIdx(), INIT_SIZE); // init zGm value AscendC::Fill(zGm, INIT_SIZE, (half)(AscendC::GetBlockIdx())); AscendC::TEventID eventIdMTE3ToMTE2 = GetTPipePtr()->FetchEventID(AscendC::HardEvent::MTE3_MTE2); AscendC::SetFlag<AscendC::HardEvent::MTE3_MTE2>(eventIdMTE3ToMTE2); AscendC::WaitFlag<AscendC::HardEvent::MTE3_MTE2>(eventIdMTE3ToMTE2); pipe->InitBuffer(inQueueX, 1, INIT_SIZE * sizeof(half)); pipe->InitBuffer(inQueueY, 1, INIT_SIZE * sizeof(half)); pipe->InitBuffer(outQueueZ, 1, INIT_SIZE * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>(); AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>(); AscendC::DataCopy(xLocal, xGm, INIT_SIZE); AscendC::DataCopy(yLocal, yGm, INIT_SIZE); inQueueX.EnQue(xLocal); inQueueY.EnQue(yLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>(); AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>(); AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>(); AscendC::Add(zLocal, xLocal, yLocal, INIT_SIZE); outQueueZ.EnQue<half>(zLocal); inQueueX.FreeTensor(xLocal); inQueueY.FreeTensor(yLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>(); // add result to zGm AscendC::SetAtomicAdd<half>(); AscendC::DataCopy(zGm, zLocal, INIT_SIZE); AscendC::SetAtomicNone(); outQueueZ.FreeTensor(zLocal); } private: AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX, inQueueY; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueZ; AscendC::GlobalTensor<half> xGm; AscendC::GlobalTensor<half> yGm; AscendC::GlobalTensor<half> zGm; }; extern "C" __global__ __aicore__ void init_global_memory_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z) { KernelFill op; TPipe pipe; op.Init(x, y, z, &pipe); op.Process(); } |
Result example:
1 2 3 4 5 6 7 8 9 10 11 12 13 | Input (x): [1. 1. 1. 1. 1. ... 1.] Input (y): [1. 1. 1. 1. 1. ... 1.] Output (z): [2. 2. 2. 2. 2. ... 2. 3. 3. 3. 3. 3. ... 3. 4. 4. 4. 4. 4. ... 4. 5. 5. 5. 5. 5. ... 5. 6. 6. 6. 6. 6. ... 6. 7. 7. 7. 7. 7. ... 7. 8. 8. 8. 8. 8. ... 8. 9. 9. 9. 9. 9. ... 9.] |