SyncAll

Product Support

Product	Supported (Soft Synchronization Prototype)	Supported (Hard Synchronization Prototype)
Atlas A3 training products/Atlas A3 inference products	√	√
Atlas A2 training products/Atlas A2 inference products	√	√
Atlas 200I/500 A2 inference products	x	x
Atlas inference product's AI Core	√	x
Atlas inference product's Vector Core	x	x
Atlas training products	√	x

Function

When different cores operate on the same global memory and data dependency issues such as read-after-write, write-after-read, or write-after-write may occur, call this function to insert synchronization statements to avoid data read/write errors caused by such data dependencies. Currently, multi-core synchronization is classified into hardware synchronization and software synchronization. Hardware synchronization uses hardware-provided all-core synchronization instructions to ensure multi-core synchronization. Software synchronization is implemented through software algorithms.

Prototype

Soft synchronization:

template <bool isAIVOnly = true>
__aicore__ inline void SyncAll(const GlobalTensor<int32_t>& gmWorkspace, const LocalTensor<int32_t>& ubWorkspace, const int32_t usedCores = 0)

Hard synchronization:

template <bool isAIVOnly = true>
__aicore__ inline void SyncAll()

Parameters

**Table 1** Template parameters
Parameter	Description
isAIVOnly	Indicates whether SyncAll is applied to pure vector operators or fused (cube and vector) operators. The values are as follows: true (default value): full-core synchronization of pure vector operators. Only full-core synchronization of vector cores is performed. false: full-core synchronization of fused operators. The full-core synchronization of vector cores and cube cores is performed separately first, and then the synchronization between the two types of cores is performed. (The soft synchronization API does not support this function.)

**Table 2** Parameters
Parameter	Input/Output	Meaning
gmWorkspace	Input	gmWorkspace is a user-defined global space and serves as the cache shared by all cores. It is used to store the status flag of each core. The type is GlobalTensor and the supported data type is int32_t. For details about the definition of the GlobalTensor data structure, see GlobalTensor. For details about the required space and precautions, see Restrictions. The hardware synchronization API does not support this parameter.
ubWorkspace	Input	ubWorkspace is user-defined local space. It is used by each core independently to mark the status of the current core. Type: LocalTensor, and the supported TPosition is VECIN/VECCALC/VECOUT. Supported data type: int32_t. For details about the required space, see Restrictions. The hardware synchronization API does not support this parameter.
usedCores	Input	Number of cores to be synchronized. The input value cannot exceed the logical blockDim value specified during operator calling. This parameter is used by default. If this parameter is not passed in, full-core soft synchronization is enabled. This parameter is supported only in the soft synchronization API.

Returns

None

Restrictions

The space allocated for the gmWorkspace cache must be greater than or equal to the number of cores multiplied by 32 bytes, and the cache value must be initialized to 0. Currently, there are two common initialization modes.
- Perform initialization on the host to ensure that the gmWorkspace cache has been initialized to 0 when this API is transferred.
- Initialize the gmWorkspace cache during kernel initialization. Note that all gmWorkspace cache space needs to be initialized on each core.
The space allocated for ubWorkspace must be greater than or equal to the number of cores multiplied by 32 bytes.
When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.
In separate mode, you are advised to use the hard synchronization API instead of the soft synchronization API. The soft synchronization API is applicable only to pure vector scenarios and has low performance. When using the hard synchronization API, you need to set the kernel type based on the scenario.
- In pure vector or cube scenarios, set the kernel type to KERNEL_TYPE_MIX_AIV_1_0 or KERNEL_TYPE_MIX_AIC_1_0.
- In vector and cube fused scenarios, set the kernel type based on the actual situation.

Example

In this example, eight cores are used for data processing. Each core processes 32 pieces of float-type data. The data is multiplied by 2 and then added to the data on other cores that are multiplied by 2 in the same way. The intermediate result is saved to workGm. Therefore, data needs to be synchronized between multiple cores. In this example, when software synchronization is used, the value of syncGm passed by the entrypoint function has been initialized to 0 on the host. If hardware synchronization is used in the following test cases, syncGm and workQueue do not need to be transferred.

#include "kernel_operator.h"

const int32_t DEFAULT_SYNCALL_NEED_SIZE = 8;

class KernelSyncAll {
public:
    __aicore__ inline KernelSyncAll() {}
    __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm, __gm__ uint8_t* workGm,
        __gm__ uint8_t* syncGm)
    {
        blockNum = AscendC::GetBlockNum(); // Obtain the total number of cores.
        perBlockSize = srcDataSize / blockNum; // Each core evenly processes the same number of pieces of data.
        blockIdx = AscendC::GetBlockIdx(); // Obtain the ID of the current working core.
        srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(srcGm + blockIdx * perBlockSize * sizeof(float)),
            perBlockSize);
        dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(dstGm + blockIdx * perBlockSize * sizeof(float)),
            perBlockSize);
        workGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(workGm), srcDataSize);
        syncGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ int32_t*>(syncGm), blockNum * DEFAULT_SYNCALL_NEED_SIZE);
        pipe.InitBuffer(inQueueSrc1, 1, perBlockSize * sizeof(float));
        pipe.InitBuffer(inQueueSrc2, 1, perBlockSize * sizeof(float));
        pipe.InitBuffer(workQueue, 1, blockNum * DEFAULT_SYNCALL_NEED_SIZE * sizeof(int32_t));
        pipe.InitBuffer(outQueueDst, 1, perBlockSize * sizeof(float));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        FirstCompute();
        CopyToWorkGlobal(); // Save the data computed by the current working core to the external workspace.
        // Wait until all cores complete the computation.
        AscendC::LocalTensor<int32_t> workLocal = workQueue.AllocTensor<int32_t>();
        AscendC::SyncAll(syncGlobal, workLocal);
        workQueue.FreeTensor(workLocal);
        // The final addition result needs to be computed after all cores complete the computation.
        AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.DeQue<float>();
        AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
        AscendC::DataCopy(dstLocal,srcLocal2,perBlockSize); // Save the data computed by the current working core to the destination space.
        inQueueSrc2.FreeTensor(srcLocal2);
        for (int i = 0; i < blockNum; i++) {
            if (i != blockIdx) {
                CopyFromOtherCore(i); // Read data from the external workspace.
                Accumulate(dstLocal); // All data is added to the destination space.
            }
        }
        outQueueDst.EnQue(dstLocal);
        CopyOut();
    }
private:
    __aicore__ inline void CopyToWorkGlobal()
    {
        AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
        AscendC::DataCopy(workGlobal[blockIdx * perBlockSize], dstLocal, perBlockSize);
        outQueueDst.FreeTensor(dstLocal);
    }
    __aicore__ inline void CopyFromOtherCore(int index)
    {
        AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>();
        AscendC::DataCopy(srcLocal, workGlobal[index * perBlockSize], perBlockSize);
        inQueueSrc1.EnQue(srcLocal);
    }
    __aicore__ inline void Accumulate(const AscendC::LocalTensor<float> &dstLocal)
    {
        AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>();
        AscendC::Add(dstLocal, dstLocal, srcLocal1, perBlockSize);
        inQueueSrc1.FreeTensor(srcLocal1);
    }
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>();
        AscendC::DataCopy(srcLocal, srcGlobal, perBlockSize);
        inQueueSrc1.EnQue(srcLocal);
    }
    __aicore__ inline void FirstCompute()
    {
        AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>();
        AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.AllocTensor<float>();
        AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
        float scalarValue(2.0);
        AscendC::Muls(dstLocal, srcLocal1, scalarValue, perBlockSize);
        AscendC::PipeBarrier<PIPE_V>();
        AscendC::DataCopy(srcLocal2,dstLocal,perBlockSize);
        inQueueSrc1.FreeTensor(srcLocal1);
        inQueueSrc2.EnQue(srcLocal2);
        outQueueDst.EnQue(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
        AscendC::DataCopy(dstGlobal, dstLocal, perBlockSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc2;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueue;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<float> srcGlobal;
    AscendC::GlobalTensor<float> dstGlobal;
    AscendC::GlobalTensor<float> workGlobal;
    AscendC::GlobalTensor<int32_t> syncGlobal;
    int srcDataSize = 256;
    int32_t blockNum = 0;
    int32_t blockIdx = 0;
    uint32_t perBlockSize = 0;
};

extern "C" __global__ __aicore__ void kernel_syncAll_float(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm,
    __gm__ uint8_t* workGm, __gm__ uint8_t* syncGm)
{
    KernelSyncAll op;
    op.Init(srcGm, dstGm, workGm, syncGm);
    op.Process();
}

Input (srcGm):
[1,1,1,1,1,...,1]
Output (dstGm):
[16,16,16,16,16,...,16]

Parent topic: Inter-Core Synchronization