InitDetermineComputeWorkspace

Supported Products

Product

Supported/Unsupported

Atlas A3 training products/Atlas A3 inference products

x

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product's AI Core

Atlas inference product's Vector Core

x

Atlas training products

x

Function Usage

Initializes the value of the GM shared memory. WaitPreBlock and NotifyNextBlock can be called only after the initialization is complete.

Prototype

1
__aicore__ inline void InitDetermineComputeWorkspace(GlobalTensor<int32_t>& gmWorkspace, LocalTensor<int32_t>& ubWorkspace)

Parameters

Table 1 Parameters

Parameter

Input/Output

Meaning

gmWorkspace

Input

Temporary space, which is used to initialize the shared memory for inter-core synchronization. The type is GlobalTensor.

ubWorkspace

Input

Temporary space, which is used to operate gmWorkspace. The type is LocalTensor.

Returns

None

Constraints

  • The minimum space allocated to gmWorkspace is blockNum * 32 bytes. The minimum space allocated to ubWorkspace is (blockNum * 32 + 32 bytes). blockNum indicates the number of called cores, which can be obtained by calling GetBlockNum.
  • When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.

Example

The following example simulates eight cores for data processing. The deterministic computing API is used to ensure the inter-core running sequence and perform atomic accumulation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#include "kernel_operator.h"

template <typename T>
class SyncTest {
public:
    __aicore__ inline SyncTest() {}
    __aicore__ inline void Init(GM_ADDR dstGm, GM_ADDR srcGm, GM_ADDR gmWorkspace,
    const DetermineComputeSyncTilingData& tiling_data)
    {
        m_elementCount = tiling_data.size;
        m_tileNum = tiling_data.tileNum;
        m_tileCount = m_elementCount / m_tileNum;

        m_dstGlobal.SetGlobalBuffer((__gm__ T*)dstGm);
        m_srcGlobal.SetGlobalBuffer((__gm__ T*)srcGm);
        m_gmWorkspace.SetGlobalBuffer((__gm__ int32_t*)gmWorkspace);

        m_pipe.InitBuffer(m_que, 1, m_elementCount * sizeof(T));
        m_pipe.InitBuffer(m_queTmp, 1, 8 * sizeof(int32_t));
    }

    __aicore__ inline void Process()
    {
        AscendC::LocalTensor<int32_t> ubWorkspace = m_queTmp.AllocTensor<int32_t>();
        AscendC::InitDetermineComputeWorkspace(m_gmWorkspace, ubWorkspace);
        for(int64_t i = 0; i < m_tileNum; i++) {
            // copy in
            AscendC::LocalTensor<T> srcLocal = m_que.AllocTensor<T>();
            AscendC::DataCopy(srcLocal, m_srcGlobal[i * m_tileCount], m_tileCount);

            // copy out
            AscendC::WaitPreBlock(m_gmWorkspace, ubWorkspace);
            AscendC::SetAtomicAdd<T>();
            AscendC::DataCopy(m_dstGlobal[i * m_tileCount], srcLocal, m_tileCount);
            AscendC::SetAtomicNone();
            AscendC::NotifyNextBlock(m_gmWorkspace, ubWorkspace);
            m_que.FreeTensor(srcLocal);
        }
        m_queTmp.FreeTensor(ubWorkspace);
    }

private:
    AscendC::TPipe m_pipe;
    int64_t m_elementCount;
    int64_t m_tileNum;
    int64_t m_tileCount;
    AscendC::GlobalTensor<T> m_srcGlobal;
    AscendC::GlobalTensor<T> m_dstGlobal;
    AscendC::GlobalTensor<int32_t> m_gmWorkspace;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> m_que;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> m_queTmp;
}; // class SyncTest

extern "C" __global__ __aicore__ void determine_compute_sync(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tiling_data, tiling);
    GM_ADDR usrWorkspace = AscendC::GetUserWorkspace(workspace); // Obtain pointer to the user workspace.

    SyncTest<float> op;
    op.Init(y, x, usrWorkspace, tiling_data);
    op.Process();
}
// Input data of each core:
[1,1,1,1,1,...,1] // 1 × 256
// Final output data:
[8,8,8,8,8,...,8] // 8 × 256