SyncAll
Function Usage
When different AI Cores operate the same global memory block, this function can be called to synchronize the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write. Currently, multi-core synchronization is classified into hardware synchronization and software synchronization. Hardware synchronization uses the full-core synchronization instruction of the hardware to ensure multi-core synchronization. Software synchronization is implemented through software algorithm simulation.
Prototype
- Soft synchronization:
1 2
template <bool isAIVOnly = true> __aicore__ inline void SyncAll(const GlobalTensor<int32_t>& gmWorkspace, const LocalTensor<int32_t>& ubWorkspace, const int32_t usedCores = 0)
- Hard synchronization:
1 2
template<bool isAIVOnly = true> __aicore__ inline void SyncAll()
Parameters
Parameter |
Input/Output |
Meaning |
|---|---|---|
gmWorkspace |
Input |
gmWorkspace is a user-defined global space and serves as the cache shared by all cores. It is used to store the status flag of each core. The type is GlobalTensor and the supported data type is int32_t. For details about the definition of the GlobalTensor data structure, see GlobalTensor. For details about the required space and precautions, see Constraints. The hardware synchronization API does not support this parameter. |
ubWorkspace |
Input |
ubWorkspace is user-defined local space. It is used by each core independently to mark the status of the current core. Type: LocalTensor. Supported TPosition: VECIN, VECCALC, and VECOUT. Supported data type: int32_t. For details about the required space, see Constraints. The hardware synchronization API does not support this parameter. |
usedCores |
Input |
Specifies the number of cores to be synchronized. The input value cannot exceed the logical blockDim value specified during operator calling. This parameter is used by default. If this parameter is not passed in, full-core soft synchronization is enabled. This parameter is supported only in the soft synchronization API. |
isAIVOnly |
Input |
Indicates whether synchronization is performed only between vector cores. The default value is true. To enable MIXCORE, set this parameter to false. |
Returns
None
Availability
Soft synchronization:
Atlas Training Series Product
Hard synchronization:
Constraints
- The size of the gmWorkspace cache must be greater than or equal to the number of cores multiplied by 32 bytes, and the cache value must be initialized to 0. Currently, there are two common initialization modes.
- Perform initialization on the host to ensure that the gmWorkspace cache has been initialized to 0 when this API is transferred.
- Initialize the gmWorkspace cache during kernel initialization. Note that all gmWorkspace cache space needs to be initialized on each core.
- The size of the space allocated for ubWorkspace must be greater than or equal to the number of cores multiplied by 32 bytes.
- Currently, the hardware synchronization API cannot be used in the kernel launch project and can be used only in the custom operator project. In addition, the workspace size in the Tiling function cannot be set to 0.
- When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | #include "kernel_operator.h" const int32_t DEFAULT_SYNCALL_NEED_SIZE = 8; class KernelSyncAll { public: __aicore__ inline KernelSyncAll() {} __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm, __gm__ uint8_t* workGm, __gm__ uint8_t* syncGm) { blockNum = AscendC::GetBlockNum(); // Obtain the total number of cores. perBlockSize = srcDataSize / blockNum; // Each core evenly processes the same number of pieces of data. blockIdx = AscendC::GetBlockIdx(); // Obtain the ID of the current working core. srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(srcGm + blockIdx * perBlockSize * sizeof(float)), perBlockSize); dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(dstGm + blockIdx * perBlockSize * sizeof(float)), perBlockSize); workGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(workGm), srcDataSize); syncGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ int32_t*>(syncGm), blockNum * DEFAULT_SYNCALL_NEED_SIZE); pipe.InitBuffer(inQueueSrc1, 1, perBlockSize * sizeof(float)); pipe.InitBuffer(inQueueSrc2, 1, perBlockSize * sizeof(float)); pipe.InitBuffer(workQueue, 1, blockNum * DEFAULT_SYNCALL_NEED_SIZE * sizeof(int32_t)); pipe.InitBuffer(outQueueDst, 1, perBlockSize * sizeof(float)); } __aicore__ inline void Process() { CopyIn(); FirstCompute(); CopyToWorkGlobal(); // Save the data computed by the current working core to the external workspace. // Wait until all cores complete the computation. AscendC::LocalTensor<int32_t> workLocal = workQueue.AllocTensor<int32_t>(); AscendC::SyncAll(syncGlobal, workLocal); workQueue.FreeTensor(workLocal); // The final addition result needs to be computed after computation on all cores are complete. AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.DeQue<float>(); AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>(); AscendC::DataCopy(dstLocal,srcLocal2,perBlockSize); // Save the data computed by the current working core to the destination space. inQueueSrc2.FreeTensor(srcLocal2); for (int i = 0; i < blockNum; i++) { if (i != blockIdx) { CopyFromOtherCore(i); // Read data from the external workspace. Accumulate(dstLocal); // All data is added to the destination space. } } outQueueDst.EnQue(dstLocal); CopyOut(); } private: __aicore__ inline void CopyToWorkGlobal() { AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>(); AscendC::DataCopy(workGlobal[blockIdx * perBlockSize], dstLocal, perBlockSize); outQueueDst.FreeTensor(dstLocal); } __aicore__ inline void CopyFromOtherCore(int index) { AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>(); AscendC::DataCopy(srcLocal, workGlobal[index * perBlockSize], perBlockSize); inQueueSrc1.EnQue(srcLocal); } __aicore__ inline void Accumulate(const AscendC::LocalTensor<float> &dstLocal) { AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>(); AscendC::Add(dstLocal, dstLocal, srcLocal1, perBlockSize); inQueueSrc1.FreeTensor(srcLocal1); } __aicore__ inline void CopyIn() { AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>(); AscendC::DataCopy(srcLocal, srcGlobal, perBlockSize); inQueueSrc1.EnQue(srcLocal); } __aicore__ inline void FirstCompute() { AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>(); AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.AllocTensor<float>(); AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>(); float scalarValue(2.0); AscendC::Muls(dstLocal, srcLocal1, scalarValue, perBlockSize); AscendC::PipeBarrier<PIPE_V>(); AscendC::DataCopy(srcLocal2,dstLocal,perBlockSize); inQueueSrc1.FreeTensor(srcLocal1); inQueueSrc2.EnQue(srcLocal2); outQueueDst.EnQue(dstLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>(); AscendC::DataCopy(dstGlobal, dstLocal, perBlockSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc1; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc2; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueue; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<float> srcGlobal; AscendC::GlobalTensor<float> dstGlobal; AscendC::GlobalTensor<float> workGlobal; AscendC::GlobalTensor<int32_t> syncGlobal; int srcDataSize = 256; int32_t blockNum = 0; int32_t blockIdx = 0; uint32_t perBlockSize = 0; }; extern "C" __global__ __aicore__ void kernel_syncAll_float(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm, __gm__ uint8_t* workGm, __gm__ uint8_t* syncGm) { KernelSyncAll op; op.Init(srcGm, dstGm, workGm, syncGm); op.Process(); } |
Input (srcGm): [1,1,1,1,1,...,1] Output (dstGm): [16,16,16,16,16,...,16]