InitDetermineComputeWorkspace
Supported Products
Product |
Supported/Unsupported |
|---|---|
x |
|
√ |
|
x |
|
√ |
|
x |
|
x |
Function Usage
Initializes the value of the GM shared memory. WaitPreBlock and NotifyNextBlock can be called only after the initialization is complete.
Prototype
1 | __aicore__ inline void InitDetermineComputeWorkspace(GlobalTensor<int32_t>& gmWorkspace, LocalTensor<int32_t>& ubWorkspace) |
Parameters
Parameter |
Input/Output |
Meaning |
|---|---|---|
gmWorkspace |
Input |
Temporary space, which is used to initialize the shared memory for inter-core synchronization. The type is GlobalTensor. |
ubWorkspace |
Input |
Temporary space, which is used to operate gmWorkspace. The type is LocalTensor. |
Returns
None
Constraints
- The minimum space allocated to gmWorkspace is blockNum * 32 bytes. The minimum space allocated to ubWorkspace is (blockNum * 32 + 32 bytes). blockNum indicates the number of called cores, which can be obtained by calling GetBlockNum.
- When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.
Example
The following example simulates eight cores for data processing. The deterministic computing API is used to ensure the inter-core running sequence and perform atomic accumulation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | #include "kernel_operator.h" template <typename T> class SyncTest { public: __aicore__ inline SyncTest() {} __aicore__ inline void Init(GM_ADDR dstGm, GM_ADDR srcGm, GM_ADDR gmWorkspace, const DetermineComputeSyncTilingData& tiling_data) { m_elementCount = tiling_data.size; m_tileNum = tiling_data.tileNum; m_tileCount = m_elementCount / m_tileNum; m_dstGlobal.SetGlobalBuffer((__gm__ T*)dstGm); m_srcGlobal.SetGlobalBuffer((__gm__ T*)srcGm); m_gmWorkspace.SetGlobalBuffer((__gm__ int32_t*)gmWorkspace); m_pipe.InitBuffer(m_que, 1, m_elementCount * sizeof(T)); m_pipe.InitBuffer(m_queTmp, 1, 8 * sizeof(int32_t)); } __aicore__ inline void Process() { AscendC::LocalTensor<int32_t> ubWorkspace = m_queTmp.AllocTensor<int32_t>(); AscendC::InitDetermineComputeWorkspace(m_gmWorkspace, ubWorkspace); for(int64_t i = 0; i < m_tileNum; i++) { // copy in AscendC::LocalTensor<T> srcLocal = m_que.AllocTensor<T>(); AscendC::DataCopy(srcLocal, m_srcGlobal[i * m_tileCount], m_tileCount); // copy out AscendC::WaitPreBlock(m_gmWorkspace, ubWorkspace); AscendC::SetAtomicAdd<T>(); AscendC::DataCopy(m_dstGlobal[i * m_tileCount], srcLocal, m_tileCount); AscendC::SetAtomicNone(); AscendC::NotifyNextBlock(m_gmWorkspace, ubWorkspace); m_que.FreeTensor(srcLocal); } m_queTmp.FreeTensor(ubWorkspace); } private: AscendC::TPipe m_pipe; int64_t m_elementCount; int64_t m_tileNum; int64_t m_tileCount; AscendC::GlobalTensor<T> m_srcGlobal; AscendC::GlobalTensor<T> m_dstGlobal; AscendC::GlobalTensor<int32_t> m_gmWorkspace; AscendC::TQue<AscendC::TPosition::VECIN, 1> m_que; AscendC::TQue<AscendC::TPosition::VECIN, 1> m_queTmp; }; // class SyncTest extern "C" __global__ __aicore__ void determine_compute_sync(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling) { GET_TILING_DATA(tiling_data, tiling); GM_ADDR usrWorkspace = AscendC::GetUserWorkspace(workspace); // Obtain pointer to the user workspace. SyncTest<float> op; op.Init(y, x, usrWorkspace, tiling_data); op.Process(); } |
// Input data of each core: [1,1,1,1,1,...,1] // 1 × 256 // Final output data: [8,8,8,8,8,...,8] // 8 × 256