IBSet
Function Usage
When different AI Cores operate the same global memory block, this function can be called to synchronize the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write. IBSet is called to set the flag bit of a core. IBSet and IBWait are used in pairs to indicate the synchronous waiting instruction between cores, waiting for the completion of a core operation.
Prototype
1 2 | template<bool isAIVOnly = true> __aicore__ inline void IBSet(const GlobalTensor<int32_t>& gmWorkspace, const LocalTensor<int32_t>& ubWorkspace, int32_t blockIdx, int32_t eventID) |
Parameters
Parameter |
Input/Output |
Description |
|---|---|---|
gmWorkspace |
Output |
Public buffer for storing the external core status. The type is GlobalTensor. For details about the definition of the GlobalTensor data structure, see GlobalTensor. |
ubWorkspace |
Input |
Public buffer that stores the current core status. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
blockIdx |
Input |
IDX number of the waiting core. The value ranges from 0 to the number of cores minus 1. |
eventID |
Input |
Controls the set and wait events of the current core. |
isAIVOnly |
Input |
Indicates whether the AIVOnly mode is used. The default value is true. |
Returns
None
Availability
Constraints
- The minimum space allocated for gmWorkspace is as follows: Number of cores * 32 bytes * eventID_max + blockIdx_max * 32 bytes + 32 bytes. (eventID_max and blockIdx_max indicate the maximum values of eventID and blockIdx, respectively.)
- In AIVOnly mode, the number of cores is GetBlockNum(). In MIX mode, the number of cores is GetBlockNum() x 2.
- The minimum size of ubWorkspace is 32 bytes.
- The value of the gmWorkspace cache needs to be initialized to 0.
- When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | #include "kernel_operator.h" constexpr int32_t TOTAL_LENGTH = 2 * 256; constexpr int32_t USE_CORE_NUM = 2; constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM; class KernelAdd { public: __aicore__ inline KernelAdd() {} __aicore__ inline void Init(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* sync, __gm__ uint8_t* z) { blockIdx = AscendC::GetBlockIdx(); xGm.SetGlobalBuffer((__gm__ half*)x); yGm.SetGlobalBuffer((__gm__ half*)y); sync_gm.SetGlobalBuffer((__gm__ int32_t *)(sync),256); zGm.SetGlobalBuffer((__gm__ half*)z); pipe.InitBuffer(inQueueX, 1, BLOCK_LENGTH * sizeof(half)); pipe.InitBuffer(inQueueY, 1, BLOCK_LENGTH * sizeof(half)); pipe.InitBuffer(vecIn, 1, 8 * sizeof(int32_t)); pipe.InitBuffer(outQueueZ, 1, BLOCK_LENGTH * sizeof(half)); } __aicore__ inline void Process() { if (blockIdx == 1) { auto sync_buf = vecIn.AllocTensor<int32_t>(); AscendC::IBWait(sync_gm, sync_buf, 0, 0); vecIn.FreeTensor(sync_buf); } CopyIn(); Compute(); CopyOut(); if (blockIdx == 0) { auto sync_buf = vecIn.AllocTensor<int32_t>(); AscendC::IBSet(sync_gm, sync_buf, 0, 0); vecIn.FreeTensor(sync_buf); } } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>(); AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>(); if (blockIdx == 1) { AscendC::DataCopy(xLocal, zGm[0 * BLOCK_LENGTH], BLOCK_LENGTH); AscendC::DataCopy(yLocal, yGm[1 * BLOCK_LENGTH], BLOCK_LENGTH); } else { AscendC::DataCopy(xLocal, xGm[0], BLOCK_LENGTH); AscendC::DataCopy(yLocal, yGm[0], BLOCK_LENGTH); } inQueueX.EnQue(xLocal); inQueueY.EnQue(yLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>(); AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>(); AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>(); AscendC::Add(zLocal, xLocal, yLocal, BLOCK_LENGTH); outQueueZ.EnQue<half>(zLocal); inQueueX.FreeTensor(xLocal); inQueueY.FreeTensor(yLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>(); AscendC::DataCopy(zGm[blockIdx * BLOCK_LENGTH], zLocal, BLOCK_LENGTH); outQueueZ.FreeTensor(zLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX, inQueueY, vecIn; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueZ; AscendC::GlobalTensor<half> xGm, yGm, zGm; AscendC::GlobalTensor<int32_t> sync_gm; int32_t blockIdx = 0; }; extern "C" __global__ __aicore__ void add_simple_kernel(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* sync, __gm__ uint8_t* z) { KernelAdd op; op.Init(x, y, sync, z); op.Process(); } |
Input: x: [1,1,1,1,1, ...,1] // 512 ones y: [1,1,1,1,1,...,1] // 512 ones Output (dstGm): [2,2,2,2,2,...,2,3,3,3,3,3,...,3] // The first 256 numbers are 2, and the last 256 numbers are 3.