AI Core
Operators developed using TBE run on AI Cores. This topic describes the basics of the AI Core architecture.
Overall Architecture
Unlike conventional CPUs and GPUs which apply to general-purpose computing and application-specific integrated circuits (ASICs) dedicated to specific algorithms, the AI Core architecture is intended for common applications and algorithms in specific fields, and is usually referred to as a domain specific architecture (DSA).
AI Core, the computing core of the Ascend AI Processor, can be regarded as a simplified architecture of modern microprocessors. It includes three basic compute units: Cube Unit, Vector Unit, and Scalar Unit. Each of these compute units has a different role to play, forming three independent pipelines that fit together under the unified scheduling of system software to achieve optimized compute efficiency. In addition, Cube Unit and Vector Unit support computation modes in various precisions with various data types.

The AI Core consists of compute units, storage units, and control units.
Compute Units
Compute units are the core units that provide powerful compute capabilities in the AI Core. The compute units include the Cube Unit, Vector Unit, and Scalar Unit, which are used to compute different types of data in the AI Core.
Compute Unit |
Description |
|---|---|
Cube |
Performs matrix computation. Each time the Cube is executed, a 16 x 16 and 16 x 16 matrix multiplication in fp16 is completed. For example, C = A x B. If int8 input is used, 16 x 32 and 32 x 16 matrix multiplication is completed at once. A is from L0A, B is from L0B, and the matrix multiplication result and intermediate result are stored in L0C. |
Vector |
Performs Vector computation. Compared with the Cube Unit, the Vector Unit offers less robust compute power but more flexible computations (such as the reciprocal and square root in mathematics). All the source and target data for Vector computation must be stored in the Unified Buffer and 32-byte aligned. |
Scalar |
Performs Scalar computation and controls the program flow. It can be regarded as a mini CPU, which implements iteration control, selection judgment, address and parameter computations of Cube/Vector instructions, and basic arithmetic operations for a program. |
Storage Units
The AI Core loads external data to the internal storage for computation. The internal storage of the AI Core includes the L1 Buffer, L0 Buffer, Unified Buffer, and Scalar Buffer.
To facilitate data transfer and movement in the AI Core, a bus interface unit (BIU) and memory transfer engines MTE1, MTE2, and MTE3 are provided. The BIU offers an interface for interaction between AI Core and the bus, while MTEs move data between different buffers.
The size of a storage unit depends on the Ascend AI Processor version. You can obtain the size of a storage unit by using the get_soc_spec call.
Table 2 shows the internal storage list of the AI Core.
Storage Unit |
Description |
|---|---|
MTE |
The AI Core provides multiple MTEs, which manage read/write of internal data between different buffers, convert formats, and perform other operations such as padding, transposing, and Img2Col. |
BIU |
As the gate of the AI Core, the BIU is responsible for the interaction between the AI Core and the bus. The BIU is an interface for the AI Core to read data from and write data to external storage. It converts the read and write requests of the AI Core into the requests of the bus and completes protocol interaction. |
L1 Buffer |
As a general internal storage, the L1 Buffer offers a large data buffer in the AI Core. It can temporarily store data that needs to be repeatedly used in the AI Core, thereby reducing the frequency that data is read from or written to the bus. For the format conversion function of some MTEs, the source data must be stored in the L1 Buffer, for example, during the Img2Col operation where a 3D image is converted to a 2D matrix. |
L0A Buffer / L0B Buffer |
They store the inputs of Cube instructions. |
L0C Buffer |
It stores the output of Cube instructions, which is part of the inputs for accumulation. |
Unified Buffer |
The Unified Buffer stores the inputs and outputs of Vector and Scalar computations. |
Scalar Buffer |
General buffer for Scalar calculations, which works with General-Purpose Registers (GPRs) as the secondary storage unit. |
GPR |
GPRs store the inputs and outputs of Scalar computation. You do not need to pay attention to these registers, which are encapsulated in the system. When the program accesses the Scalar Buffer and performs Scalar computation, the system automatically synchronizes the Scalar Buffer with the GPR. |
SPR |
Special-purpose registers (SPRs) are a group of configuration registers of the AI Core. You can partially adjust the behaviors of the AI Core by modifying SPR content. |
In this document, the storage units of AI Core are classified into the following types:
- Cache: invisible to programmers. When an instruction specifies the access to a lower-level storage unit, the cache can cache data to accelerate the access.
- Buffer: visible to programmers and used to temporarily store data during vector or scalar computation.
- Register: visible to programmers and usually used for scalar computation.
The AI Core storage units can be accessed only by using specific instructions. Figure 2 shows the relationships between the storage units and instructions opened to developers.
The storage units in the preceding figure are software-level illustration.
- The Scalar Buffer corresponds to the hardware-level Scalar Buffer.
- Unified Buffer corresponds to the hardware-level Unified Buffer.
- L1 Buffer corresponds to the hardware-level L1 Buffer.
- L1Out Buffer is a storage unit abstracted from L0C for storing the output of the Cube Unit.
Currently, only development of vector operators is supported, since high-performance Cube operators are difficult to develop.
Control Units
Control units provide instruction control for the entire computing process, responsible for running the AI Core. Table 3 lists the control units of the AI Core. System Control is responsible for the operation of AI Core, parameter configuration, and power consumption control. After instructions are sequentially transmitted by using the Instruction Dispatch module, they are separately sent to the Cube Queue, Vector Queue, or MTE Queue module by type.
A control unit prefetches subsequent instructions during instruction execution and reads multiple instructions into the cache at a time, improving instruction execution efficiency. Multiple instructions are transmitted from the system memory to the instruction cache module (Instruction Cache) of the AI Core through the BIU, waiting to be decoded or computed by hardware. After being decoded, instructions are imported to the scalar queue to implement address decoding and operation control.
Control Unit |
Description |
|---|---|
System Control |
It externally controls the Task Scheduler and sets parameters for AI Core initialization, for example, configuring PC, Para_base, and BlockID information. This unit offers functions such as controlling block execution, reporting the interrupt and status after block execution, and reporting the execution error. |
Instruction Cache |
As internal instruction cache memory of the AI Core, I Cache enables instruction prefetch. |
Scalar PSQ |
It holds scalar instructions for processing. |
Instruction Dispatch |
With Cube Unit, Vector Unit, and MTE instructions processed by Scalar PSQ and elements such as addresses and parameters configured, Instruction Dispatch dispatches the instructions to the corresponding instruction pipes by type and waits for the corresponding execution units to schedule and execute the instructions. |
Cube Queue |
Cube pipe. Instructions in the same pipe are executed in sequence, and instructions in different pipes can be executed in parallel. |
Vector Queue |
Vector pipe. Instructions in the same pipe are executed in sequence, and instructions in different pipes can be executed in parallel. |
MTE Queue |
MTE pipe. Instructions in the same pipe are executed in sequence, and instructions in different pipes can be executed in parallel. |
Event Sync |
It controls the dependency and synchronization between instructions across pipes. |
Pipes
AI Core reads instructions in order and executes them in parallel, as shown in the following figure.

The instructions are decoded in order and executed in either of the following mode:
- If an instruction is a Scalar instruction, it will be executed immediately.
- If an instruction is not a Scalar instruction, it will be scheduled to the according pipe and then allocated to an idle compute unit for execution.
The following table lists the pipes.
Table 4 Pipes Pipe Abbreviation
Pipe Name
Description
V
Vector pipe
Schedules Vector instructions.
M
Matrix pipe
Schedules Cube instructions.
MTE1
MTE pipe 1
Schedules MTE instructions of the following types:
L1 to L0A/L0B/UB, or L0A/L0B Buffer initialization using the SPR
MTE2
MTE pipe 2
Schedules MTE instructions of the following types:
GM to L1/L0A/L0B/UB
MTE3
MTE pipe 3
Schedules MTE instructions of the following types:
UB to GM
Instructions are dispatched to different pipes. Including the S pipe (for Scalar instructions), an AI Core has six pipes: S, V, M, MTE1, MTE2, and MTE3.
Except for the S pipe, instructions in different pipes can be executed out of order. However, instructions within a pipe are executed in order. That is, on the premise that data dependency is met, the instructions can be executed regardless of the programmed order.
Hardware dispatches the instructions to different pipes based on the delivery sequence. The Barrier and set_flag/wait_flag instructions are provided by Ascend AI Processor for intra- and inter-pipe synchronization.- Barrier synchronizes the instructions in the same pipe. Instructions after the barrier cannot issue until all instructions before the barrier are committed.
- set_flag and wait_flag are a pair of inter-pipe synchronization instructions.
- set_flag: The current instruction starts to be executed after all read and write operations of the current instruction are completed and the corresponding flag bit in hardware is set to 1.
- wait_flag: When this instruction is executed, if the corresponding flag bit is 0, the subsequent instructions in the pipe are blocked; if the corresponding flag bit is 1, it is changed to 0, and subsequent instructions are executed.
Because TBE has encapsulated this dependency, you do not need to program Barrier, set_flag or wait_flag. However, they still need to understand this basic principle to achieve better synchronization through proper code scheduling. DSL provides the Auto Schedule mechanism, freeing you from the hassles associated with code schedule.
