Introduction to Synchronization Control
You can use the synchronization control API to implement synchronization control. Generally, you do not need to pay attention to synchronization when programming based on the programming model and paradigm described in the Programming Model . The programming model helps you complete synchronization control. The programming model and paradigm are recommended. Self-synchronization control may cause programming complexity and is not recommended. The difference between the TQueSync APIs and the synchronization control APIs provided in Inter-Core Synchronization is that the APIs in Inter-Core Synchronization are of the ISASI type, which cannot guarantee compatibility across hardware versions, whereas the TQueSync APIs can.
Overview
Before getting to know synchronization control, you need to be familiar with the content in Figure 1.
Asynchronous parallel computing process of the AI Core: The Scalar Unit reads the instruction sequence and issues the vector computation, matrix computation, and data transfer instructions to the instruction queues of other units. The Vector Unit, Cube Unit, and DMA units asynchronously execute the received instructions in parallel. See the instruction streams shown in blue arrows in Figure 1.
Different instructions may depend on each other. To ensure that instructions in different instruction queues are executed based on the correct logic, the Scalar Unit also issues synchronization instructions to other units. For details about the synchronization process between units, see the synchronization signal streams shown in green arrows in Figure 1.
The internal data processing of the AI Core is as follows: The DMA copy-in unit moves data to Local Memory. The Vector/Cube Units complete data computation and write the computation result back to Local Memory. And then, the DMA copy-out unit moves the processed data back to Global Memory. See the red data flows in Figure 1.
As shown in the preceding figure, the execution units in the AI Core are asynchronously parallel. There might be a dependent relationship when the local memory is read or written, requiring synchronization control.
The following figure shows a common vector computing data flow.
- The DMA execution unit transfers data from the global memory to the local memory.
- Perform calculation.
- Then, the DMA execution unit moves the calculation result from the local memory to the global memory.

The four execution units Scalar, Vector, DMA (VECIN), and DMA (VECOUT) are executed in parallel. If the same local memory is accessed, a synchronization mechanism is required to control the access timing. That is, data needs to be moved to the local memory for computation and then moved out after computation.

Pipelines
The types and explanations of parallel instruction pipelines in the AI Core are as follows:
|
Type |
Meaning |
|---|---|
|
PIPE_S |
Scalar computation pipeline, which is used when Tensor GetValue is used. |
|
PIPE_V |
Vector compute pipeline and the data transfer pipeline from the L0C Buffer to the UB in some hardware architectures. |
|
PIPE_M |
Matrix computation pipeline. |
|
PIPE_MTE1 |
Data transfer pipeline from the L1 Buffer to the L0A Buffer and from the L1 Buffer to the L0B Buffer. |
|
PIPE_MTE2 |
Data transfer pipeline from the GM->L1 Buffer and GM->UB. |
|
PIPE_MTE3 |
Data transfer pipeline from the UB->GM. |
|
PIPE_FIX |
Data transfer pipeline from the L0C Buffer to the GM and from the L0C Buffer to the L1. |
Synchronization Control Classification
There are two types of parallel pipelines for synchronization control:
- Multi-pipeline synchronization: SetFlag/WaitFlag of the TQueSync APIs or the SetFlag/WaitFlag(ISASI) APIs are called to control the synchronization between different pipelines.
- SetFlag: The current instruction starts to be executed after all read and write operations of the current instruction are completed and the corresponding flag bit in hardware is set to 1.
- WaitFlag: When this instruction is executed, if the corresponding flag bit is 0, the subsequent instructions in the queue are blocked; if the corresponding flag bit is 1, the subsequent instructions are executed after this bit is changed to 0.
- Single-pipeline synchronization: PipeBarrier(ISASI) is called to implement synchronization control in a single pipeline to restrict the execution sequence. This ensures that instructions after the barrier cannot issue until all instructions before the barrier are committed.
Requirements on Enabling Manual Synchronization Control
- Vector unit
- Single-pipeline synchronization: PIPE_V is automatically inserted synchronously by the compiler. If the movement addresses of PIPE_MTE2 and PIPE_MTE3 overlap, you need to insert synchronization. For details, see Precautions.
- Multi-pipeline synchronization: The multi-pipeline synchronization between PIPE_V, PIPE_MTE2, PIPE_MTE3, and PIPE_S is bidirectional. As shown in the following figure, the yellow arrows indicate that synchronization is automatically inserted by the compiler, and the remaining synchronization is completed by the Ascend C framework. Exception for

- Cube unit
All pipeline synchronization on the Cube is completed by the Ascend C framework. You do not need to manually insert synchronization.
