InitDetermineComputeWorkspace

Product Support

Product	Supported
Atlas A3 training products/Atlas A3 inference products	x
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Initializes the value of the GM shared memory. WaitPreBlock and NotifyNextBlock can be called only after the initialization is complete.

Prototype

__aicore__ inline void InitDetermineComputeWorkspace(GlobalTensor<int32_t>& gmWorkspace, LocalTensor<int32_t>& ubWorkspace)

Parameters

**Table 1** Parameters
Parameter	Input/Output	Meaning
gmWorkspace	Input	Temporary space, which is used to initialize the shared memory for inter-core synchronization. The type is GlobalTensor.
ubWorkspace	Input	Temporary space, which is used to operate gmWorkspace. The type is LocalTensor.

Returns

None

Restrictions

The minimum space allocated to gmWorkspace is blockNum * 32 bytes. The minimum space allocated to ubWorkspace is (blockNum * 32 + 32 bytes). blockNum indicates the number of called cores, which can be obtained by calling GetBlockNum.
When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.

Example

The following example simulates eight cores for data processing. The deterministic computing API is used to ensure the inter-core running sequence and perform atomic accumulation.

#include "kernel_operator.h"

template <typename T>
class SyncTest {
public:
    __aicore__ inline SyncTest() {}
    __aicore__ inline void Init(GM_ADDR dstGm, GM_ADDR srcGm, GM_ADDR gmWorkspace,
    const DetermineComputeSyncTilingData& tiling_data)
    {
        m_elementCount = tiling_data.size;
        m_tileNum = tiling_data.tileNum;
        m_tileCount = m_elementCount / m_tileNum;

        m_dstGlobal.SetGlobalBuffer((__gm__ T*)dstGm);
        m_srcGlobal.SetGlobalBuffer((__gm__ T*)srcGm);
        m_gmWorkspace.SetGlobalBuffer((__gm__ int32_t*)gmWorkspace);

        m_pipe.InitBuffer(m_que, 1, m_elementCount * sizeof(T));
        m_pipe.InitBuffer(m_queTmp, 1, 8 * sizeof(int32_t));
    }

    __aicore__ inline void Process()
    {
        AscendC::LocalTensor<int32_t> ubWorkspace = m_queTmp.AllocTensor<int32_t>();
        AscendC::InitDetermineComputeWorkspace(m_gmWorkspace, ubWorkspace);
        for(int64_t i = 0; i < m_tileNum; i++) {
            // copy in
            AscendC::LocalTensor<T> srcLocal = m_que.AllocTensor<T>();
            AscendC::DataCopy(srcLocal, m_srcGlobal[i * m_tileCount], m_tileCount);

            // copy out
            AscendC::WaitPreBlock(m_gmWorkspace, ubWorkspace);
            AscendC::SetAtomicAdd<T>();
            AscendC::DataCopy(m_dstGlobal[i * m_tileCount], srcLocal, m_tileCount);
            AscendC::SetAtomicNone();
            AscendC::NotifyNextBlock(m_gmWorkspace, ubWorkspace);
            m_que.FreeTensor(srcLocal);
        }
        m_queTmp.FreeTensor(ubWorkspace);
    }

private:
    AscendC::TPipe m_pipe;
    int64_t m_elementCount;
    int64_t m_tileNum;
    int64_t m_tileCount;
    AscendC::GlobalTensor<T> m_srcGlobal;
    AscendC::GlobalTensor<T> m_dstGlobal;
    AscendC::GlobalTensor<int32_t> m_gmWorkspace;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> m_que;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> m_queTmp;
}; // class SyncTest

extern "C" __global__ __aicore__ void determine_compute_sync(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tiling_data, tiling);
    GM_ADDR usrWorkspace = AscendC::GetUserWorkspace(workspace); // Obtain pointer to the user workspace.

    SyncTest<float> op;
    op.Init(y, x, usrWorkspace, tiling_data);
    op.Process();
}

// Input data of each core:
[1,1,1,1,1,...,1] // 1 × 256
// Final output data:
[8,8,8,8,8,...,8] // 8 × 256

Parent topic: Inter-Core Synchronization