SyncAll

Supported Products

Product

Supported (Soft Synchronization Prototype)

Supported (Hard Synchronization Prototype)

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

x

Atlas inference product's AI Core

x

Atlas inference product's Vector Core

x

x

Atlas training products

x

Functions

When different AI Cores operate the same global memory block, this function can be called to synchronize the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write. Currently, multi-core synchronization is classified into hardware synchronization and software synchronization. Hardware synchronization uses the full-core synchronization instruction of the hardware to ensure multi-core synchronization. Software synchronization is implemented through software algorithm simulation.

Prototype

  • Soft synchronization:
    1
    2
    template <bool isAIVOnly = true>
    __aicore__ inline void SyncAll(const GlobalTensor<int32_t>& gmWorkspace, const LocalTensor<int32_t>& ubWorkspace, const int32_t usedCores = 0)
    
  • Hard synchronization:
    1
    2
    template <bool isAIVOnly = true>
    __aicore__ inline void SyncAll()
    

Parameters

Table 1 Parameters in the template

Parameter

Description

isAIVOnly

Controls whether SyncAll is applied to pure Vector operators or fused (Cube and Vector) operators. The values are as follows:

  • true (default value): Full-core synchronization of pure Vector operators. Only full-core synchronization of Vector cores is performed.
  • false: full-core synchronization of fused operators. The full-core synchronization of vector cores and cube cores is performed separately, and then the synchronization between the two cores is performed. (This function is not supported by the soft synchronization API.)
Table 2 Parameters

Parameter

Input/Output

Meaning

gmWorkspace

Input

gmWorkspace is a user-defined global space and serves as the cache shared by all cores. It is used to store the status flag of each core. The type is GlobalTensor and the supported data type is int32_t. For details about the definition of the GlobalTensor data structure, see GlobalTensor.

For details about the required space and precautions, see Constraints.

The hardware synchronization API does not support this parameter.

ubWorkspace

Input

ubWorkspace is user-defined local space. It is used by each core independently to mark the status of the current core.

Type: LocalTensor, and the supported TPosition is VECIN/VECCALC/VECOUT. Supported data type: int32_t.

For details about the required space, see Constraints.

The hardware synchronization API does not support this parameter.

usedCores

Input

Number of cores to be synchronized. The input value cannot exceed the logical blockDim value specified during operator calling. This parameter is used by default. If this parameter is not passed in, full-core soft synchronization is enabled.

This parameter is supported only in the soft synchronization API.

Returns

None

Constraints

  • The space allocated for the gmWorkspace cache must be greater than or equal to the number of cores multiplied by 32 bytes, and the cache value must be initialized to 0. Currently, there are two common initialization modes.
    • Perform initialization on the host to ensure that the gmWorkspace cache has been initialized to 0 when this API is transferred.
    • Initialize the gmWorkspace cache during kernel initialization. Note that all gmWorkspace cache space needs to be initialized on each core.
  • The space allocated for ubWorkspace must be greater than or equal to the number of cores multiplied by 32 bytes.
  • When this API is used for multi-core control, the logical blockDim specified during operator calling must be less than or equal to the number of cores for running the operator. Otherwise, the framework inserts abnormal synchronization during multi-round scheduling, causing the kernel to stop responding.
  • In separated mode, you are advised to use the hard synchronization API instead of the soft synchronization API. The soft synchronization API is applicable only to pure vector scenarios and has low performance. When using the hard synchronization API, you need to set the kernel type based on the scenario.
    • In pure vector/cube scenarios, set the kernel type to KERNEL_TYPE_MIX_AIV_1_0 or KERNEL_TYPE_MIX_AIC_1_0.
    • In the scenario where Vector and Cube are used together, the kernel type needs to be configured based on the actual situation.

Example

In this example, eight cores are used for data processing. Each core processes 32 pieces of float-type data. The data is multiplied by 2 and then added to the data on other cores that are multiplied by 2 in the same way. The intermediate result is saved to workGm. Therefore, data needs to be synchronized between multiple cores. In this example, when software synchronization is used, the value of syncGm passed by the entrypoint function has been initialized to 0 on the host. If hardware synchronization is used in the following test cases, syncGm and workQueue do not need to be transferred.
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
#include "kernel_operator.h"

const int32_t DEFAULT_SYNCALL_NEED_SIZE = 8;

class KernelSyncAll {
public:
    __aicore__ inline KernelSyncAll() {}
    __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm, __gm__ uint8_t* workGm,
        __gm__ uint8_t* syncGm)
    {
        blockNum = AscendC::GetBlockNum(); // Obtain the total number of cores.
        perBlockSize = srcDataSize / blockNum; // Each core evenly processes the same number of pieces of data.
        blockIdx = AscendC::GetBlockIdx(); // Obtain the ID of the current working core.
        srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(srcGm + blockIdx * perBlockSize * sizeof(float)),
            perBlockSize);
        dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(dstGm + blockIdx * perBlockSize * sizeof(float)),
            perBlockSize);
        workGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float*>(workGm), srcDataSize);
        syncGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ int32_t*>(syncGm), blockNum * DEFAULT_SYNCALL_NEED_SIZE);
        pipe.InitBuffer(inQueueSrc1, 1, perBlockSize * sizeof(float));
        pipe.InitBuffer(inQueueSrc2, 1, perBlockSize * sizeof(float));
        pipe.InitBuffer(workQueue, 1, blockNum * DEFAULT_SYNCALL_NEED_SIZE * sizeof(int32_t));
        pipe.InitBuffer(outQueueDst, 1, perBlockSize * sizeof(float));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        FirstCompute();
        CopyToWorkGlobal(); // Save the data computed by the current working core to the external workspace.
        // Wait until all cores complete the computation.
        AscendC::LocalTensor<int32_t> workLocal = workQueue.AllocTensor<int32_t>();
        AscendC::SyncAll(syncGlobal, workLocal);
        workQueue.FreeTensor(workLocal);
        // The final addition result needs to be computed after all cores complete the computation.
        AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.DeQue<float>();
        AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
        AscendC::DataCopy(dstLocal,srcLocal2,perBlockSize); // Save the data computed by the current working core to the destination space.
        inQueueSrc2.FreeTensor(srcLocal2);
        for (int i = 0; i < blockNum; i++) {
            if (i != blockIdx) {
                CopyFromOtherCore(i); // Read data from the external workspace.
                Accumulate(dstLocal); // All data is added to the destination space.
            }
        }
        outQueueDst.EnQue(dstLocal);
        CopyOut();
    }
private:
    __aicore__ inline void CopyToWorkGlobal()
    {
        AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
        AscendC::DataCopy(workGlobal[blockIdx * perBlockSize], dstLocal, perBlockSize);
        outQueueDst.FreeTensor(dstLocal);
    }
    __aicore__ inline void CopyFromOtherCore(int index)
    {
        AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>();
        AscendC::DataCopy(srcLocal, workGlobal[index * perBlockSize], perBlockSize);
        inQueueSrc1.EnQue(srcLocal);
    }
    __aicore__ inline void Accumulate(const AscendC::LocalTensor<float> &dstLocal)
    {
        AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>();
        AscendC::Add(dstLocal, dstLocal, srcLocal1, perBlockSize);
        inQueueSrc1.FreeTensor(srcLocal1);
    }
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<float> srcLocal = inQueueSrc1.AllocTensor<float>();
        AscendC::DataCopy(srcLocal, srcGlobal, perBlockSize);
        inQueueSrc1.EnQue(srcLocal);
    }
    __aicore__ inline void FirstCompute()
    {
        AscendC::LocalTensor<float> srcLocal1 = inQueueSrc1.DeQue<float>();
        AscendC::LocalTensor<float> srcLocal2 = inQueueSrc2.AllocTensor<float>();
        AscendC::LocalTensor<float> dstLocal = outQueueDst.AllocTensor<float>();
        float scalarValue(2.0);
        AscendC::Muls(dstLocal, srcLocal1, scalarValue, perBlockSize);
        AscendC::PipeBarrier<PIPE_V>();
        AscendC::DataCopy(srcLocal2,dstLocal,perBlockSize);
        inQueueSrc1.FreeTensor(srcLocal1);
        inQueueSrc2.EnQue(srcLocal2);
        outQueueDst.EnQue(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> dstLocal = outQueueDst.DeQue<float>();
        AscendC::DataCopy(dstGlobal, dstLocal, perBlockSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc2;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueue;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<float> srcGlobal;
    AscendC::GlobalTensor<float> dstGlobal;
    AscendC::GlobalTensor<float> workGlobal;
    AscendC::GlobalTensor<int32_t> syncGlobal;
    int srcDataSize = 256;
    int32_t blockNum = 0;
    int32_t blockIdx = 0;
    uint32_t perBlockSize = 0;
};

extern "C" __global__ __aicore__ void kernel_syncAll_float(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm,
    __gm__ uint8_t* workGm, __gm__ uint8_t* syncGm)
{
    KernelSyncAll op;
    op.Init(srcGm, dstGm, workGm, syncGm);
    op.Process();
}
Input (srcGm):
[1,1,1,1,1,...,1]
Output (dstGm):
[16,16,16,16,16,...,16]