IBSet

Function Usage

When different AI Cores operate the same global memory block, this function can be called to synchronize the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write. IBSet is called to set the flag bit of a core. IBSet and IBWait are used in pairs to indicate the synchronous waiting instruction between cores, waiting for the completion of a core operation.

Prototype

template<bool isAIVOnly = true>
__aicore__ inline void IBSet(const GlobalTensor<int32_t>& gmWorkspace, const LocalTensor<int32_t>& ubWorkspace, int32_t blockIdx, int32_t eventID)

Parameters

**Table 1** Parameters
Parameter	Input/Output	Description
gmWorkspace	Output	Public buffer for storing the external core status. The type is GlobalTensor. For details about the definition of the GlobalTensor data structure, see GlobalTensor.
ubWorkspace	Input	Public buffer that stores the current core status. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
blockIdx	Input	IDX number of the waiting core. The value ranges from 0 to the number of cores minus 1.
eventID	Input	Controls the set and wait events of the current core.
isAIVOnly	Input	Indicates whether the AIVOnly mode is used. The default value is true.

Returns

None

Availability

Atlas Training Series Product

Constraints

The minimum space allocated for gmWorkspace is as follows: Number of cores * 32 bytes * eventID_max + blockIdx_max * 32 bytes + 32 bytes. (eventID_max and blockIdx_max indicate the maximum values of eventID and blockIdx, respectively.)
In AIVOnly mode, the number of cores is GetBlockNum(). In MIX mode, the number of cores is GetBlockNum() x 2.
The minimum size of ubWorkspace is 32 bytes.
The value of the gmWorkspace cache needs to be initialized to 0.
When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.

Example

In this example, two cores are used for data processing, and each core processes 256 pieces of half-type data. Core 0 implements the x+y operation and stores the result in the first half of z. Core 1 stores the computation result of core 0 in x, adds the computation result to y, and stores the result in the second half of z. Therefore, data synchronization needs to be performed between multiple cores.

#include "kernel_operator.h"

constexpr int32_t TOTAL_LENGTH = 2 * 256;
constexpr int32_t USE_CORE_NUM = 2;
constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;
class KernelAdd {
public:
    __aicore__ inline KernelAdd() {}
    __aicore__ inline void Init(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* sync, __gm__ uint8_t* z)
    {
        blockIdx = AscendC::GetBlockIdx();
        xGm.SetGlobalBuffer((__gm__ half*)x);
        yGm.SetGlobalBuffer((__gm__ half*)y);
        sync_gm.SetGlobalBuffer((__gm__ int32_t *)(sync),256);
        zGm.SetGlobalBuffer((__gm__ half*)z);

        pipe.InitBuffer(inQueueX, 1, BLOCK_LENGTH * sizeof(half));
        pipe.InitBuffer(inQueueY, 1, BLOCK_LENGTH * sizeof(half));
        pipe.InitBuffer(vecIn, 1, 8 * sizeof(int32_t));
        pipe.InitBuffer(outQueueZ, 1, BLOCK_LENGTH * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        if (blockIdx == 1) {
            auto sync_buf = vecIn.AllocTensor<int32_t>();
            AscendC::IBWait(sync_gm, sync_buf, 0, 0);
            vecIn.FreeTensor(sync_buf);
        }
        CopyIn();
        Compute();
        CopyOut();
        if (blockIdx == 0) {
            auto sync_buf = vecIn.AllocTensor<int32_t>();
            AscendC::IBSet(sync_gm, sync_buf, 0, 0);
            vecIn.FreeTensor(sync_buf);
        }
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
        if (blockIdx == 1) {
            AscendC::DataCopy(xLocal, zGm[0 * BLOCK_LENGTH], BLOCK_LENGTH);
            AscendC::DataCopy(yLocal, yGm[1 * BLOCK_LENGTH], BLOCK_LENGTH);
        } else {
            AscendC::DataCopy(xLocal, xGm[0], BLOCK_LENGTH);
            AscendC::DataCopy(yLocal, yGm[0], BLOCK_LENGTH);
        }
        inQueueX.EnQue(xLocal);
        inQueueY.EnQue(yLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
        AscendC::Add(zLocal, xLocal, yLocal, BLOCK_LENGTH);
        outQueueZ.EnQue<half>(zLocal);
        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
        AscendC::DataCopy(zGm[blockIdx * BLOCK_LENGTH], zLocal, BLOCK_LENGTH);
        outQueueZ.FreeTensor(zLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX, inQueueY, vecIn;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueZ;
    AscendC::GlobalTensor<half> xGm, yGm, zGm;
    AscendC::GlobalTensor<int32_t> sync_gm;
    int32_t blockIdx = 0;
};

extern "C" __global__ __aicore__ void add_simple_kernel(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* sync,
    __gm__ uint8_t* z)
{
    KernelAdd op;
    op.Init(x, y, sync, z);
    op.Process();
}

Input:
x: [1,1,1,1,1, ...,1] // 512 ones
y: [1,1,1,1,1,...,1] // 512 ones
Output (dstGm):
[2,2,2,2,2,...,2,3,3,3,3,3,...,3] // The first 256 numbers are 2, and the last 256 numbers are 3.

Parent topic: Inter-Core Synchronization