Automatic Synchronization

When Ascend C is used for operator compilation, you can set the automatic synchronization compilation option --cce-auto-sync of the BiSheng Compiler to automatically insert the following synchronization instructions between execution units in the AI Core (This option is enabled in kernel launch projects and custom operator projects by default.):

Between MTE2 and the Scalar Unit
Between MTE3 and the Scalar Unit
Between the Vector and Scalar Units
Between the Vector Units

The Ascend C programming framework and compiler provide the following automatic synchronization functions. For details, see "Introduction to Synchronization Control" in the .

Single-pipeline synchronization: PIPE_V is automatically inserted by the compiler. If the transfer addresses of PIPE_MTE2 and PIPE_MTE3 overlap, you need to manually insert synchronization.
Multi-pipeline synchronization: The multi-pipeline synchronization between PIPE_V, PIPE_MTE2, PIPE_MTE3, and PIPE_S is bidirectional. As shown in the following figure, the yellow line indicates that synchronization is automatically inserted by the compiler, and the remaining synchronization is completed by the Ascend C framework.

Restrictions on Automatic Synchronization

To use the automatic synchronization function, the following restrictions must be met: All functions called in the kernel function must be inline functions. The Ascend C programming model must be used properly.

All functions called in the kernel function must be inline functions.

In the following example, non-inline functions do not support automatic synchronization.

...
// Process function in the operator class implementation
__aicore__ void Process()
{
    CopyIn();
    Compute();
    CopyOut();
}
__aicore__ void CopyIn()
{
    LocalTensor<int32_t> srcLocal = inQueueSrc.AllocTensor<int32_t>();
    DataCopy(srcLocal, srcGlobal, 512);
    inQueueSrc.EnQue(srcLocal);
}
__aicore__ void Compute()
{
    LocalTensor<int32_t> srcLocal = inQueueSrc.DeQue<int32_t>();
    LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>()
    uint64_t mask = 64;
    Copy(dstLocal, srcLocal, mask, 4, { 1, 1, 8, 8 });
    outQueueDst.EnQue<int32_t>(dstLocal);
    inQueueSrc.FreeTensor(srcLocal);
}
__aicore__ void CopyOut()
{
    LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>();
    DataCopy(dstGlobal, dstLocal, 512);
    outQueueDst.FreeTensor(dstLocal);
}
...

The preceding example needs to be modified to the following format to achieve automatic synchronization:

...
// Process function in the operator class implementation
__aicore__ inline void Process()
{
    CopyIn();
    Compute();
    CopyOut();
}
__aicore__ inline void CopyIn()
{
    LocalTensor<int32_t> srcLocal = inQueueSrc.AllocTensor<int32_t>();
    DataCopy(srcLocal, srcGlobal, 512);
    inQueueSrc.EnQue(srcLocal);
}

__aicore__ inline void Compute()
{
    LocalTensor<int32_t> srcLocal = inQueueSrc.DeQue<int32_t>();
    LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>()
    uint64_t mask = 64;
    Copy(dstLocal, srcLocal, mask, 4, { 1, 1, 8, 8 });
    outQueueDst.EnQue<int32_t>(dstLocal);
    inQueueSrc.FreeTensor(srcLocal);
}

__aicore__ inline void CopyOut()
{
    LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>();
    DataCopy(dstGlobal, dstLocal, 512);
    outQueueDst.FreeTensor(dstLocal);
}
...

The Ascend C programming model must be used properly.

In the following example, the Ascend C programming model (such as the EnQue(), DeQue(), AllocTensor(), and FreeTensor() APIs) is not used. Therefore, automatic synchronization is not supported.

...
// The Ascend C programming model is not used.
__aicore__ inline void CopyIn()
{
    DataCopy(srcLocal, srcGlobal, 512);
}
__aicore__ inline void Compute()
{
    for(int i = 0;i<dstDataSize; i++) {
       dstLocal.SetValue(i,srcLocal.GetValue(i));
    }
}
__aicore__ inline void CopyOut()
{
    DataCopy(dstGlobal, dstLocal, 512);
}
private:
    TPipe pipe;
    LocalTensor<int32_t> srcLocal, dstLocal;
    GlobalTensor<int32_t> srcGlobal, dstGlobal;
    int dstDataSize = 512;
...

To achieve automatic synchronization, change the format to the following:

// Properly use memory management and synchronization control APIs such as EnQue(), DeQue(), AllocTensor(), and FreeTensor() according to the programming paradigm.
...
__aicore__ inline void CopyIn()
{
    LocalTensor<int32_t> srcLocal = inQueueSrc.AllocTensor<int32_t>();
    DataCopy(srcLocal, srcGlobal, 512);
    inQueueSrc.EnQue(srcLocal);
}
__aicore__ inline void Compute()
{
    LocalTensor<int32_t> srcLocal = inQueueSrc.DeQue<int32_t>();
    LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>()
    for(int i = 0;i<dstDataSize; i++) {
       dstLocal.SetValue(i,srcLocal.GetValue(i));
    }
    outQueueDst.EnQue<int32_t>(dstLocal);
    inQueueSrc.FreeTensor(srcLocal);
}
__aicore__ inline void CopyOut()
{
    LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>();
    DataCopy(dstGlobal, dstLocal, 512);
    outQueueDst.FreeTensor(dstLocal);
}
private:
    TPipe pipe;
    TQue<QuePosition::VECIN, 1> inQueueSrc;
    TQue<QuePosition::VECOUT, 1> outQueueDst;
    GlobalTensor<int32_t> srcGlobal, dstGlobal;
    int dstDataSize = 512;
...

Automatically Synchronization of Debug Logs

The BiSheng Compiler provides the --cce-auto-sync-log=<file> compilation option to output the synchronization insertion information to the <file> file, helping you explicitly identify the synchronization instruction information inserted by the compiler in the operator file. Compile the operator in debug mode (with the -g compilation option added) to obtain the line number of the operator code file.

In the scenario where the BiSheng Compiler is directly used, you can directly add this compilation option to the compilation command.
If you use the Ascend C kernel launch project, you can add this compilation option by using ascendc_compile_options.
If you use an Ascend C custom operator project, you can add this compilation option by using add_ops_compile_options.

The code file sync_log_test.h is as follows:

LocalTensor<T> dstLocal;
T ave_tmp = 0;
Vector_OP1(dstLocal, params); 
ave_tmp = dstLocal.GetValue(0);
Vector_OP2(dstLocal, params); 
for (int i = 0; i < ave_tmp; ++i) {
    dstLocal.SetValue(i,0);
}

After automatic synchronization is enabled, the synchronization instruction is inserted in the following position:

LocalTensor<T> dstLocal;
T ave_tmp = 0;
Vector_OP1(dstLocal, params); 
SetFlag<HardEvent::V_S>(EVENT_ID0);
WaitFlag<HardEvent::V_S>(EVENT_ID0);
ave_tmp = dstLocal.GetValue(0);
PipeBarrier<PIPE_V>();
SetFlag<HardEvent::S_V>(EVENT_ID0);
WaitFlag<HardEvent::S_V>(EVENT_ID0);
Vector_OP2(dstLocal, params); 
SetFlag<HardEvent::V_S>(EVENT_ID0);
WaitFlag<HardEvent::V_S>(EVENT_ID0);
for (int i = 0; i < ave_tmp; ++i) {
    dstLocal.SetValue(i,0);
}

After the function of automatically synchronizing debug logs is enabled, the generated log is as follows:

The BiSheng Auto Sync log of sync_log_test :  
Position: absolute-path/sync_log_test.h:4 : line before insert sync : SetFlag<HardEvent::V_S>(EVENT_ID0);
Position: absolute-path/sync_log_test.h:4 : line before insert sync : WaitFlag<HardEvent::V_S>(EVENT_ID0);
Position: absolute-path/sync_log_test.h:5 : line before insert sync : PipeBarrier<PIPE_V>();
Position: absolute-path/sync_log_test.h:5 : line before insert sync : SetFlag<HardEvent::S_V>(EVENT_ID0);
Position: absolute-path/sync_log_test.h:5 : line before insert sync : WaitFlag<HardEvent::S_V>(EVENT_ID0);
Position: absolute-path/sync_log_test.h:6 : line before insert sync : SetFlag<HardEvent::V_S>(EVENT_ID0);
Position: absolute-path/sync_log_test.h:6 : line before insert sync : WaitFlag<HardEvent::V_S>(EVENT_ID0);

line before indicates that the insertion position is immediately after the synchronization instruction in the current line.

Parent topic: Basic Programming Guide