Introduction to Synchronization Control

You can use the synchronization control API to implement synchronization control. Generally, there is no need to consider synchronization when programming is based on the programming model and paradigms described in Programming Model. The programming model implements synchronization control. Using programming model and paradigms is recommended. Manual synchronization control may increase programming complexity, which is not recommended. The difference between the TQueSync APIs and the synchronization control APIs provided in SetFlag/WaitFlag(ISASI) is that the APIs in SetFlag/WaitFlag(ISASI) are of the ISASI type and cannot guarantee compatibility across hardware versions, while the TQueSync APIs can.

Overview

Before getting to know synchronization control, you need to be familiar with the content in Figure 1.

Asynchronous parallel computing process of the AI Core: The Scalar Unit reads the instruction sequence and issues the vector computation, matrix computation, and data transfer instructions to the instruction queues of other units. The Vector Unit, Cube Unit, and DMA units asynchronously execute the received instructions in parallel. See the instruction streams shown in blue arrows in Figure 1.

Different instructions may depend on each other. To ensure that instructions in different instruction queues are executed based on the correct logic, the Scalar Unit also issues synchronization instructions to other units. For details about the synchronization process between units, see the synchronization signal streams shown in green arrows in Figure 1.

The internal data processing of the AI Core is as follows: The DMA copy-in unit moves data to Local Memory. The Vector/Cube Units complete data computation and write the computation result back to Local Memory. And then, the DMA copy-out unit moves the processed data back to Global Memory. See the red data flows in Figure 1.

Figure 1 Abstraction of the AI Core's internal parallel computing architecture

As shown in the preceding figure, the execution units in the AI Core are asynchronously parallel. There might be a dependent relationship when the local memory is read or written, requiring synchronization control.

The following figure shows a common vector compute data flow:

The DMA execution unit transfers data from the global memory to the local memory.
Computation is performed.
The DMA execution unit transfers the compute result from the local memory to the global memory.

The four execution units Scalar, Vector, DMA (VECIN), and DMA (VECOUT) are executed in parallel. If the same local memory is accessed, a synchronization mechanism is required to control the access timing. That is, data needs to be moved to the local memory for computation and then moved out after computation.

Pipelines

The following describes the parallel instruction pipelines in the AI Core.

The specific pipelines contained in each hardware pipeline type vary according to the hardware architecture. For details, see Hardware Implementation.

**Table 1** Description of instruction pipelines
Type	Meaning
PIPE_S	Scalar computation pipeline, which is used when Tensor GetValue is used.
PIPE_V	Vector computation pipeline and data transfer pipeline from L0C Buffer to UB in some hardware architectures.
PIPE_M	Matrix computation pipeline.
PIPE_MTE1	Data transfer pipeline from L1 Buffer to L0A Buffer and from L1 Buffer to L0B Buffer.
PIPE_MTE2	Data transfer pipeline from GM to L1 Buffer, from GM to UB, and other.
PIPE_MTE3	Data transfer pipeline from UB to GM and other.
PIPE_FIX	Data transfer pipeline from L0C Buffer to GM, from L0C Buffer to L1, and other.

Synchronization Control Classification

There are two types of parallel pipelines for synchronization control:

Multi-pipeline synchronization: SetFlag/WaitFlag of the TQueSync APIs or the APIs in SetFlag/WaitFlag(ISASI) are called to control the synchronization between different pipelines.
- SetFlag: The current instruction starts to be executed after all read and write operations of the current instruction are completed and the corresponding flag bit in hardware is set to 1.
- WaitFlag: When this instruction is executed, if the corresponding flag bit is 0, the subsequent instructions in the queue are blocked; if the corresponding flag bit is 1, the subsequent instructions are executed after this bit is changed to 0.
Single-pipeline synchronization: PipeBarrier(ISASI) is called to implement synchronization control in a single pipeline to restrict the execution sequence. This ensures that instructions after the barrier cannot issue until all instructions before the barrier are committed.

Requirements on Enabling Manual Synchronization Control

Vector unit
- Single-pipeline synchronization: For PIPE_V, synchronization is automatically inserted by the compiler. If the movement addresses of PIPE_MTE2/PIPE_MTE3 overlap, you need to insert synchronization. For details, see precautions.
- Multi-pipeline synchronization: The multi-pipeline synchronization between PIPE_V, PIPE_MTE2, PIPE_MTE3, and PIPE_S is bidirectional. As shown in the following figure, the yellow arrows indicate that synchronization is automatically inserted by the compiler, and the remaining synchronization is completed by the Ascend C framework.
Cube unit
All pipeline synchronization on the Cube is completed by the Ascend C framework. You do not need to manually insert synchronization.

The synchronization inserted automatically by the compiler strongly depends on local tensors. In some scenarios, you need to manually complete the synchronization. For details about the restrictions, see BiSheng Compiler User Guide.

Parent topic: Intra-Core Synchronization