Synchronization Control

The TQueSync APIs provide synchronization control. Generally, there is no need to consider synchronization when programming is based on the programming model and paradigms described in Programming Model. The programming model implements synchronization control. Using programming model and paradigms is recommended. Manual synchronization control may increase programming complexity, which is not recommended. The difference between the TQueSync APIs and the synchronization control APIs provided in Inter-Core Synchronization is that the APIs in Inter-Core Synchronization are of the ISASI type, which cannot guarantee compatibility across hardware versions, whereas the TQueSync APIs can.

Overview

Before getting to know synchronization control, you need to be familiar with the content in Figure 1.

Asynchronous parallel computing process of the AI Core: The Scalar Unit reads the instruction sequence and issues the vector computation, matrix computation, and data transfer instructions to the instruction queues of other units. The Vector Unit, Cube Unit, and DMA units asynchronously execute the received instructions in parallel. See the instruction streams shown in blue arrows in Figure 1.

Different instructions may depend on each other. To ensure that instructions in different instruction queues are executed based on the correct logic, the Scalar Unit also issues synchronization instructions to other units. For details about the synchronization process between units, see the synchronization signal streams shown in green arrows in Figure 1.

The internal data processing of the AI Core is as follows: The DMA copy-in unit moves data to the local memory. The Vector/Cube Units complete data computation and write the computation result back to the local memory. And then, the DMA copy-out unit moves the processed data back to the global memory. See the data streams shown in red arrows in Figure 1.

Figure 1 Abstraction of the AI Core's internal parallel computing architecture

As shown in the preceding figure, the execution units in the AI Core are asynchronously parallel. There might be a dependent relationship when the local memory is read or written, requiring synchronization control.

The following figure shows common data streams for Vector computation. The DMA unit moves data from the global memory to the local memory for computation, and then moves the computation result from the local memory to the global memory.

The four execution units Scalar, Vector, DMA (VECIN), and DMA (VECOUT) are executed in parallel. If the same local memory is accessed, a synchronization mechanism is required to control the access timing. That is, data needs to be moved to the local memory for computation and then moved out after computation.

Hardware Pipelines

The following describes the parallel instruction pipelines in the AI Core. For details about the hardware pipelines, see Architecture.

**Table 1** Description of instruction pipelines
Type	Description
PIPE_S	Scalar computation pipeline, which is used when Tensor GetValue is used.
PIPE_V	Vector computation pipeline and data movement pipeline for L0C -> UB.
PIPE_M	Matrix computation pipeline.
PIPE_MTE1	L1 -> L0A and L1 -> L0B data movement pipelines.
PIPE_MTE2	Data movement pipelines for GM -> L1, GM -> L0A, GM -> L0B, and GM -> UB.
PIPE_MTE3	Data movement pipelines for UB -> GM and UB -> L1.
PIPE_FIX	Data movement pipelines for L0C -> GM and L0C -> L1, which is not supported in the current version.

Synchronization Control Classification

There are two types of parallel pipelines for synchronization control:

Multi-pipeline synchronization: SetFlag/WaitFlag of the TQueSync APIs or the SetFlag/WaitFlag(ISASI) APIs are called to control the synchronization between different pipelines.
- SetFlag: The current instruction starts to be executed after all read and write operations of the current instruction are completed and the corresponding flag bit in hardware is set to 1.
- WaitFlag: When this instruction is executed, if the corresponding flag bit is 0, the subsequent instructions in the queue are blocked; if the corresponding flag bit is 1, the subsequent instructions are executed after this bit is changed to 0.
Single-pipeline synchronization: PipeBarrier(ISASI) is called to implement synchronization control in a single pipeline to restrict the execution sequence. This ensures that instructions after the barrier cannot issue until all instructions before the barrier are committed.

Requirements on Enabling Manual Synchronization Control

Vector Unit
- Single-pipeline synchronization: PIPE_V is automatically inserted synchronously by the compiler. If the movement addresses of PIPE_MTE2 and PIPE_MTE3 overlap, you need to insert synchronization. For details, see Precautions.
- Multi-pipeline synchronization: The multi-pipeline synchronization between PIPE_V, PIPE_MTE2, PIPE_MTE3, and PIPE_S is bidirectional. As shown in the following figure, the yellow arrows indicate that synchronization is automatically inserted by the compiler, and the remaining synchronization is completed by the Ascend C framework.
Cube Unit
All pipeline synchronization on the Cube is completed by the Ascend C framework. You do not need to manually insert synchronization.

The synchronization inserted automatically by the compiler strongly depends on local tensors. In some scenarios, you need to manually complete the synchronization. For details about the restrictions, see BiSheng Compiler User Guide.

Parent topic: TQueSync