SyncAll

Function Usage

When different AI Cores operate the same global memory block, this function can be called to synchronize the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write. Currently, multi-core synchronization is classified into hardware synchronization and software synchronization. Hardware synchronization uses the full-core synchronization instruction of the hardware to ensure multi-core synchronization. Software synchronization is implemented through software algorithm simulation.

Prototype

  • Soft synchronization:
    1
    2
    template <bool isAIVOnly = true>
    __aicore__ inline void SyncAll(const GlobalTensor<int32_t>& gmWorkspace, const LocalTensor<int32_t>& ubWorkspace, const int32_t usedCores = 0)
    
  • Hard synchronization:
    1
    2
    template<bool isAIVOnly = true>
    __aicore__ inline void SyncAll()
    

Parameters

Table 1 Parameters of the multi-core synchronous API

Parameter

Input/Output

Meaning

gmWorkspace

Input

gmWorkspace is a user-defined global space and serves as the cache shared by all cores. It is used to store the status flag of each core. The type is GlobalTensor and the supported data type is int32_t. For details about the definition of the GlobalTensor data structure, see GlobalTensor.

For details about the required space and precautions, see Constraints.

The hardware synchronization API does not support this parameter.

ubWorkspace

Input

ubWorkspace is user-defined local space. It is used by each core independently to mark the status of the current core.

Type: LocalTensor. Supported TPosition: VECIN, VECCALC, and VECOUT. Supported data type: int32_t.

For details about the required space, see Constraints.

The hardware synchronization API does not support this parameter.

usedCores

Input

Specifies the number of cores to be synchronized. The input value cannot exceed the logical blockDim value specified during operator calling. This parameter is used by default. If this parameter is not passed in, full-core soft synchronization is enabled.

This parameter is supported only in the soft synchronization API.

isAIVOnly

Input

Indicates whether synchronization is performed only between vector cores. The default value is true. To enable MIXCORE, set this parameter to false.

Returns

None

Availability

Soft synchronization:

  • Atlas Training Series Product

Hard synchronization:

Constraints

  • The size of the gmWorkspace cache must be greater than or equal to the number of cores multiplied by 32 bytes, and the cache value must be initialized to 0. Currently, there are two common initialization modes.
    • Perform initialization on the host to ensure that the gmWorkspace cache has been initialized to 0 when this API is transferred.
    • Initialize the gmWorkspace cache during kernel initialization. Note that all gmWorkspace cache space needs to be initialized on each core.
  • The size of the space allocated for ubWorkspace must be greater than or equal to the number of cores multiplied by 32 bytes.
  • Currently, the hardware synchronization API cannot be used in the kernel launch project and can be used only in the custom operator project. In addition, the workspace size in the Tiling function cannot be set to 0.
  • When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.

Example

In this example, eight cores are used for data processing. Each core processes 32 pieces of float-type data. The data is multiplied by 2 and then added to the data on other cores that are multiplied by 2 in the same way. The intermediate result is saved to workGm. Therefore, data needs to be synchronized between multiple cores. In this example, when software synchronization is used, the value of syncGm passed by the entrypoint function has been initialized to 0 on the host. If hardware synchronization is used in the following test cases, syncGm and workQueue do not need to be transferred.
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
#include "kernel_operator.h"

const int32_t DEFAULT_SYNCALL_NEED_SIZE = 8;

class KernelSyncAll {
public:
    __aicore__ inline KernelSyncAll() {}
    __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm, __gm__ uint8_t* workGm,
        __gm__ uint8_t* syncGm)
    {
        blockNum = AscendC::GetBlockNum(); // Obtain the total number of cores.
        perBlockSize = srcDataSize / blockNum; // Each core evenly processes the same number of pieces of data.
        blockIdx = AscendC::GetBlockIdx(); // Obtain the ID of the current working core.
        srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(srcGm + blockIdx * perBlockSize * sizeof(float)),
            perBlockSize);
        dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(dstGm + blockIdx * perBlockSize * sizeof(float)),
            perBlockSize);
        workGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(workGm), srcDataSize);
        syncGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ int32_t*>(syncGm), blockNum * DEFAULT_SYNCALL_NEED_SIZE);
        pipe.InitBuffer(inQueueSrc1, 1, perBlockSize * sizeof(float));
        pipe.InitBuffer(inQueueSrc2, 1, perBlockSize * sizeof(float));
        pipe.InitBuffer(workQueue, 1, blockNum * DEFAULT_SYNCALL_NEED_SIZE * sizeof(int32_t));
        pipe.InitBuffer(outQueueDst, 1, perBlockSize * sizeof(float));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        FirstCompute();
        CopyToWorkGlobal(); // Save the data computed by the current working core to the external workspace.
        // Wait until all cores complete the computation.
        AscendC::LocalTensor<int32_t> workLocal = workQueue.AllocTensor<int32_t>();
        AscendC::SyncAll(syncGlobal, workLocal);
        workQueue.FreeTensor(workLocal);
        // The final addition result needs to be computed after computation on all cores are complete.
        AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.DeQue<float>();
        AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
        AscendC::DataCopy(dstLocal,srcLocal2,perBlockSize); // Save the data computed by the current working core to the destination space.
        inQueueSrc2.FreeTensor(srcLocal2);
        for (int i = 0; i < blockNum; i++) {
            if (i != blockIdx) {
                CopyFromOtherCore(i); // Read data from the external workspace.
                Accumulate(dstLocal); // All data is added to the destination space.
            }
        }
        outQueueDst.EnQue(dstLocal);
        CopyOut();
    }
private:
    __aicore__ inline void CopyToWorkGlobal()
    {
        AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
        AscendC::DataCopy(workGlobal[blockIdx * perBlockSize], dstLocal, perBlockSize);
        outQueueDst.FreeTensor(dstLocal);
    }
    __aicore__ inline void CopyFromOtherCore(int index)
    {
        AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>();
        AscendC::DataCopy(srcLocal, workGlobal[index * perBlockSize], perBlockSize);
        inQueueSrc1.EnQue(srcLocal);
    }
    __aicore__ inline void Accumulate(const AscendC::LocalTensor<float> &dstLocal)
    {
        AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>();
        AscendC::Add(dstLocal, dstLocal, srcLocal1, perBlockSize);
        inQueueSrc1.FreeTensor(srcLocal1);
    }
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>();
        AscendC::DataCopy(srcLocal, srcGlobal, perBlockSize);
        inQueueSrc1.EnQue(srcLocal);
    }
    __aicore__ inline void FirstCompute()
    {
        AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>();
        AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.AllocTensor<float>();
        AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
        float scalarValue(2.0);
        AscendC::Muls(dstLocal, srcLocal1, scalarValue, perBlockSize);
        AscendC::PipeBarrier<PIPE_V>();
        AscendC::DataCopy(srcLocal2,dstLocal,perBlockSize);
        inQueueSrc1.FreeTensor(srcLocal1);
        inQueueSrc2.EnQue(srcLocal2);
        outQueueDst.EnQue(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
        AscendC::DataCopy(dstGlobal, dstLocal, perBlockSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc1;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc2;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueue;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<float> srcGlobal;
    AscendC::GlobalTensor<float> dstGlobal;
    AscendC::GlobalTensor<float> workGlobal;
    AscendC::GlobalTensor<int32_t> syncGlobal;
    int srcDataSize = 256;
    int32_t blockNum = 0;
    int32_t blockIdx = 0;
    uint32_t perBlockSize = 0;
};

extern "C" __global__ __aicore__ void kernel_syncAll_float(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm,
    __gm__ uint8_t* workGm, __gm__ uint8_t* syncGm)
{
    KernelSyncAll op;
    op.Init(srcGm, dstGm, workGm, syncGm);
    op.Process();
}
Input (srcGm):
[1,1,1,1,1,...,1]
Output (dstGm):
[16,16,16,16,16,...,16]