Basic Architecture
As shown in the following figure, operators developed using Ascend C run on the AI Core. The Ascend C programming model is introduced based on the abstraction of the AI Core hardware architecture. Understanding the hardware architecture helps you better understand the programming model. For experienced developers who need to complete high-performance programming, it is even more important to learn about the hardware architecture. Many contents in the operator practice reference are introduced based on this chapter.

The AI Core is responsible for executing cube and vector computation–intensive tasks. It consists of the following components:
- Compute units: include the Cube Unit, Vector Unit, and Scalar Unit.
- Storage units: include the L1 Buffer, L0A Buffer, L0B Buffer, L0C Buffer, Unified Buffer, BiasTable Buffer, and FixPipe Buffer, which are designed for efficient computation.
- Movement units: include the MTE1, MTE2, MTE3, and FixPipe, which are used for efficient data transmission between different storage units.
Take

This section first introduces the key concepts and terms related to the hardware architecture and the working mode of the AI Core, laying a foundation for understanding the subsequent content. Then, using
Key Concepts and Terms
- Core
A compute core with an independent Scalar Unit. The Scalar Unit is responsible for functions such as instruction transmission within the core and is also referred to as the scheduling unit within the core.
- AI Core
A compute core of an Ascend AI Processor, which executes tasks with intensive cube and vector compute.
- Cube Core
A core dedicated for cube computation. It consists of the Scalar Unit, Cube Unit, and DMA Unit, but does not include the Vector Unit.
- Vector Core
A core dedicated for vector compute. It consists of the Scalar Unit, Vector Unit, and DMA Unit, but does not include the Cube Unit.
- AIC
A Cube Core in a group of Cube Cores and Vector Cores in separation mode of AI Core.
- AIV
A Vector Core in a group of Cube Cores and Vector Cores in separation mode of AI Core.
AI Core Working Modes
- Separation mode
A working mode of AI Core. The Cube Unit and Vector Unit each have their own Scalar Unit, and they are deployed on the Cube Core and Vector Core, respectively. The Cube Core and Vector Core are combined in a certain ratio (1:N). Such a combination is considered as an AI Core, and the number of cores of the AI Core is determined by the Cube Core.
Figure 1 Separation mode diagram (N is subject to the value obtained by the hardware platform information acquisition API.)
- Coupling mode
A working mode of AI Core. The Cube Unit and Vector Unit share the same Scalar Unit, which is deployed on an AI Core.
Figure 2 Coupling mode diagram
Atlas inference products : coupling modeAtlas training products : coupling modeAtlas A2 training products /Atlas A2 inference products : separation modeAtlas A3 training products /Atlas A3 inference products : separation modeAtlas 200I/500 A2 inference products : coupling mode
Note: For the
Compute Units
The compute units are the core of powerful compute capabilities of the AI Core, including the Cube Unit, Vector Unit, and Scalar Unit. These three units process different types of data in the AI Core.
- Cube
The Cube Unit performs cube operations. Take the float16 data type as an example. The Cube Unit can multiply two 16×16 cubes of the float16 data type in each execution. As shown in the following figure, the highlighted boxes show the Cube Unit and its accessed storage units. L0A stores the left cube, L0B stores the right cube, and L0C stores the cube multiplication result and intermediate result.
Figure 3 Data access of the Cube Unit
- Vector
The Vector Unit performs vector computation. It executes vector instructions, which are similar to conventional single instruction multiple data (SIMD) instructions. Each vector instruction can complete the same type of operation for a plurality of operands. The Vector Unit can quickly add or multiply two float16 vectors. Vector instructions support multiple iterations and direct computation of vectors with intervals.
As shown in the following figure, all the computed source data and target data of the vector must be stored in the Unified Buffer. The start address and operation length of vector instructions must be 32-byte aligned. For details, see the API constraints.
Figure 4 Data access of the Vector Unit
- Scalar
The Scalar Unit performs scalar computation and controls the program flow. It can be regarded as a mini CPU, which implements iteration control, branch judgment, address and parameter computations of Cube and Vector instructions, and basic arithmetic operations for a program. In addition, the pipeline of other execution units in the AI Core can be controlled by inserting a synchronization code into the event synchronization module. Compared with the host CPU, the scalar computation capability on the AI Core is relatively weak, and scalars are mainly used to send instructions. Therefore, scalar computations should be minimized in actual application scenarios, for example, minimizing the branch judgment such as the if/else statement and variable computation during performance optimization.
As shown in the following figure, the Scalar Unit executes standard arithmetic logic unit (ALU) statements when executing scalar operation instructions. The code segment and data segment (stack space) required by ALU come from the GM. The instruction cache (iCache) is used to cache code segments, and the cache size is related to hardware specifications. If the size is 16 KB or 32 KB, the code segment is loaded in the unit of 2 KB. The data cache (dCache) is used to cache data segments, and the cache size is related to hardware specifications. If the size is 16 KB, the data segment is loaded in the unit of cache lines (64 bytes). Considering that access inside the core is most efficient, ensure that code segments and data segments are cached in the iCache and dCache to avoid access outside the core. In addition, based on the data loading unit, you can adjust the size of data to be loaded at a time during programming to improve the loading efficiency. For example, when data is loaded to the dCache and the start address of the data memory is aligned with the cache lines (64 bytes), the loading efficiency is the highest.
Figure 5 Instruction and data access of the Scalar Unit
The hardware provides the L2 cache to cache the data (including code segments and data segments) that accesses the GM, thereby accelerating the access speed and improving the access efficiency. The L2 cache outside the core loads data in the unit of cache lines. The cache line size (128, 256, or 512 bytes) varies with hardware specifications.
Storage Units and DMA Units
To fully leverage the powerful computing capability of the AI processors, the prerequisite is that the input data can be fed to the compute units in a timely and accurate manner. Therefore, the memory system needs to be elaborately designed to ensure the data supply required by the compute units.
As shown in the following figure, an AI Core includes multiple levels of local memory (internal storage), and loads data from the global memory (external storage) to the local memory for computation. The local memory of an AI Core includes the L1 Buffer, L0 Buffer, Unified Buffer, and so on. An AI Core also contains memory transfer engines (MTEs) to facilitate data movement and copy, which can convert the data format or type during data movement.
For details about the internal storage units and DMA units, see Table 1 and Table 2.
|
Storage Unit |
Description |
|---|---|
|
L1 Buffer |
As a general part of the local memory, the L1 Buffer offers a large data buffer in the AI Core. It can temporarily store data that needs to be repeatedly used in the Cube Unit, thereby reducing the frequency that data is read from or written to the bus. |
|
L0A Buffer/L0B Buffer |
They store the inputs of Cube instructions. |
|
L0C Buffer |
It stores the outputs of Cube instructions, which are part of the inputs for accumulation. |
|
Unified Buffer |
The Unified Buffer stores the inputs and outputs of vector and scalar computations. |
|
BT Buffer |
The BiasTable (BT) Buffer stores the bias in cube computations. |
|
FP Buffer |
The FixPipe (FP) Buffer stores quantization parameters and ReLU parameters. |
|
DMA Unit |
Description |
|---|---|
|
MTE1 |
Moves data in the following paths:
|
|
MTE2 |
Moves data in the following paths:
|
|
MTE3 |
Moves data in the following paths:
|
|
FixPipe |
Moves data in the following paths. (The data format or type can be converted during data movement.)
|
- The size of a storage unit varies with the AI processor type. You can call GetCoreMemSize to get the size.
- By default, all data that is read from or written to the GM through the DMA units is cached in the L2 cache to accelerate the access speed and improve the access efficiency. The L2 cache outside the core loads data in the unit of cache lines. The cache line size (128, 256, or 512 bytes) varies with hardware specifications.
Typical Data Flows
Typical Instruction Flows
Multiple instructions enter the iCache from the system memory through the bus. Based on the instruction type, there are two kinds of subsequent instruction execution processes:
- For a scalar instruction, it will be executed immediately by the Scalar Unit.
- For other instructions, they are scheduled to independent queues (vector instruction queue, cube instruction queue, and MTE1/MTE2/MTE3 instruction queues) by the Scalar Unit, and then executed by corresponding execution units.
- PipeBarrier is an instruction used to control the execution sequence in a queue. (Although instructions are executed in sequence, it does not mean that the execution of the previous instruction ends when the next instruction starts to be executed.) PipeBarrier ensures that all data read and write operations of the previous instruction are completed before the next instruction is executed.
- SetFlag and WaitFlag are a pair of inter-queue synchronization instructions.
- SetFlag: The current instruction starts to be executed after all read and write operations of the current instruction are completed and the corresponding flag bit in hardware is set to 1.
- WaitFlag: When this instruction is executed, if the corresponding flag bit is 0, the subsequent instructions in the queue are blocked; if the corresponding flag bit is 1, the subsequent instructions are executed after this bit is changed to 0.
Ascend C provides synchronization control APIs. You can use this type of APIs to implement synchronization control. Generally, there is no need to consider synchronization when programming based on the programming model and paradigm described in Programming Model. The programming model implements synchronization control. Using the programming model and paradigm is recommended. Manual synchronization control may complicate programming.
However, we still hope that you can understand the basic principles of synchronization to better understand and design parallel computing programs. In a few cases, you need to manually insert synchronization. For details, see Requirements on Enabling Manual Synchronization Control.

