SyncAll
Supported Products
Product |
Supported (Soft Synchronization Prototype) |
Supported (Hard Synchronization Prototype) |
|---|---|---|
√ |
√ |
|
√ |
√ |
|
x |
x |
|
√ |
x |
|
x |
x |
|
√ |
x |
Functions
When different AI Cores operate the same global memory block, this function can be called to synchronize the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write. Currently, multi-core synchronization is classified into hardware synchronization and software synchronization. Hardware synchronization uses the full-core synchronization instruction of the hardware to ensure multi-core synchronization. Software synchronization is implemented through software algorithm simulation.
Prototype
- Soft synchronization:
1 2
template <bool isAIVOnly = true> __aicore__ inline void SyncAll(const GlobalTensor<int32_t>& gmWorkspace, const LocalTensor<int32_t>& ubWorkspace, const int32_t usedCores = 0)
- Hard synchronization:
1 2
template <bool isAIVOnly = true> __aicore__ inline void SyncAll()
Parameters
Parameter |
Description |
|---|---|
isAIVOnly |
Controls whether SyncAll is applied to pure Vector operators or fused (Cube and Vector) operators. The values are as follows:
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
gmWorkspace |
Input |
gmWorkspace is a user-defined global space and serves as the cache shared by all cores. It is used to store the status flag of each core. The type is GlobalTensor and the supported data type is int32_t. For details about the definition of the GlobalTensor data structure, see GlobalTensor. For details about the required space and precautions, see Constraints. The hardware synchronization API does not support this parameter. |
ubWorkspace |
Input |
ubWorkspace is user-defined local space. It is used by each core independently to mark the status of the current core. Type: LocalTensor, and the supported TPosition is VECIN/VECCALC/VECOUT. Supported data type: int32_t. For details about the required space, see Constraints. The hardware synchronization API does not support this parameter. |
usedCores |
Input |
Number of cores to be synchronized. The input value cannot exceed the logical blockDim value specified during operator calling. This parameter is used by default. If this parameter is not passed in, full-core soft synchronization is enabled. This parameter is supported only in the soft synchronization API. |
Returns
None
Constraints
- The space allocated for the gmWorkspace cache must be greater than or equal to the number of cores multiplied by 32 bytes, and the cache value must be initialized to 0. Currently, there are two common initialization modes.
- Perform initialization on the host to ensure that the gmWorkspace cache has been initialized to 0 when this API is transferred.
- Initialize the gmWorkspace cache during kernel initialization. Note that all gmWorkspace cache space needs to be initialized on each core.
- The space allocated for ubWorkspace must be greater than or equal to the number of cores multiplied by 32 bytes.
- When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.
- In separated mode, you are advised to use the hard synchronization API instead of the soft synchronization API. The soft synchronization API is applicable only to pure vector scenarios and has low performance. When using the hard synchronization API, you need to set the kernel type based on the scenario.
- In pure vector/cube scenarios, set the kernel type to KERNEL_TYPE_MIX_AIV_1_0 or KERNEL_TYPE_MIX_AIC_1_0.
- In the scenario where Vector and Cube are used together, the kernel type needs to be configured based on the actual situation.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | #include "kernel_operator.h" const int32_t DEFAULT_SYNCALL_NEED_SIZE = 8; class KernelSyncAll { public: __aicore__ inline KernelSyncAll() {} __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm, __gm__ uint8_t* workGm, __gm__ uint8_t* syncGm) { blockNum = AscendC::GetBlockNum(); // Obtain the total number of cores. perBlockSize = srcDataSize / blockNum; // Each core evenly processes the same number of pieces of data. blockIdx = AscendC::GetBlockIdx(); // Obtain the ID of the current working core. srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(srcGm + blockIdx * perBlockSize * sizeof(float)), perBlockSize); dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(dstGm + blockIdx * perBlockSize * sizeof(float)), perBlockSize); workGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(workGm), srcDataSize); syncGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ int32_t*>(syncGm), blockNum * DEFAULT_SYNCALL_NEED_SIZE); pipe.InitBuffer(inQueueSrc1, 1, perBlockSize * sizeof(float)); pipe.InitBuffer(inQueueSrc2, 1, perBlockSize * sizeof(float)); pipe.InitBuffer(workQueue, 1, blockNum * DEFAULT_SYNCALL_NEED_SIZE * sizeof(int32_t)); pipe.InitBuffer(outQueueDst, 1, perBlockSize * sizeof(float)); } __aicore__ inline void Process() { CopyIn(); FirstCompute(); CopyToWorkGlobal(); // Save the data computed by the current working core to the external workspace. // Wait until all cores complete the computation. AscendC::LocalTensor<int32_t> workLocal = workQueue.AllocTensor<int32_t>(); AscendC::SyncAll(syncGlobal, workLocal); workQueue.FreeTensor(workLocal); // The final addition result needs to be computed after all cores complete the computation. AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.DeQue<float>(); AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>(); AscendC::DataCopy(dstLocal,srcLocal2,perBlockSize); // Save the data computed by the current working core to the destination space. inQueueSrc2.FreeTensor(srcLocal2); for (int i = 0; i < blockNum; i++) { if (i != blockIdx) { CopyFromOtherCore(i); // Read data from the external workspace. Accumulate(dstLocal); // All data is added to the destination space. } } outQueueDst.EnQue(dstLocal); CopyOut(); } private: __aicore__ inline void CopyToWorkGlobal() { AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>(); AscendC::DataCopy(workGlobal[blockIdx * perBlockSize], dstLocal, perBlockSize); outQueueDst.FreeTensor(dstLocal); } __aicore__ inline void CopyFromOtherCore(int index) { AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>(); AscendC::DataCopy(srcLocal, workGlobal[index * perBlockSize], perBlockSize); inQueueSrc1.EnQue(srcLocal); } __aicore__ inline void Accumulate(const AscendC::LocalTensor<float> &dstLocal) { AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>(); AscendC::Add(dstLocal, dstLocal, srcLocal1, perBlockSize); inQueueSrc1.FreeTensor(srcLocal1); } __aicore__ inline void CopyIn() { AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>(); AscendC::DataCopy(srcLocal, srcGlobal, perBlockSize); inQueueSrc1.EnQue(srcLocal); } __aicore__ inline void FirstCompute() { AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>(); AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.AllocTensor<float>(); AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>(); float scalarValue(2.0); AscendC::Muls(dstLocal, srcLocal1, scalarValue, perBlockSize); AscendC::PipeBarrier<PIPE_V>(); AscendC::DataCopy(srcLocal2,dstLocal,perBlockSize); inQueueSrc1.FreeTensor(srcLocal1); inQueueSrc2.EnQue(srcLocal2); outQueueDst.EnQue(dstLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>(); AscendC::DataCopy(dstGlobal, dstLocal, perBlockSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc1; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc2; AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueue; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<float> srcGlobal; AscendC::GlobalTensor<float> dstGlobal; AscendC::GlobalTensor<float> workGlobal; AscendC::GlobalTensor<int32_t> syncGlobal; int srcDataSize = 256; int32_t blockNum = 0; int32_t blockIdx = 0; uint32_t perBlockSize = 0; }; extern "C" __global__ __aicore__ void kernel_syncAll_float(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm, __gm__ uint8_t* workGm, __gm__ uint8_t* syncGm) { KernelSyncAll op; op.Init(srcGm, dstGm, workGm, syncGm); op.Process(); } |
Input (srcGm): [1,1,1,1,1,...,1] Output (dstGm): [16,16,16,16,16,...,16]