AI Core

Operators developed using TBE run on AI Cores. This topic describes the basics of the AI Core architecture.

Overall Architecture

Unlike conventional CPUs and GPUs which apply to general-purpose computing and application-specific integrated circuits (ASICs) dedicated to specific algorithms, the AI Core architecture is intended for common applications and algorithms in specific fields, and is usually referred to as a domain specific architecture (DSA).

AI Core, the computing core of the Ascend AI Processor, can be regarded as a simplified architecture of modern microprocessors. It includes three basic compute units: Cube Unit, Vector Unit, and Scalar Unit. Each of these compute units has a different role to play, forming three independent pipelines that fit together under the unified scheduling of system software to achieve optimized compute efficiency. In addition, Cube Unit and Vector Unit support computation modes in various precisions with various data types.

Figure 1 AI Core architecture

The AI Core consists of compute units, storage units, and control units.

Compute Units

Compute units are the core units that provide powerful compute capabilities in the AI Core. The compute units include the Cube Unit, Vector Unit, and Scalar Unit, which are used to compute different types of data in the AI Core.

**Table 1** Description of compute units
Compute Unit	Description
Cube	Performs matrix computation. Each time the Cube is executed, a 16 x 16 and 16 x 16 matrix multiplication in fp16 is completed. For example, C = A x B. If int8 input is used, 16 x 32 and 32 x 16 matrix multiplication is completed at once. A is from L0A, B is from L0B, and the matrix multiplication result and intermediate result are stored in L0C.
Vector	Performs Vector computation. Compared with the Cube Unit, the Vector Unit offers less robust compute power but more flexible computations (such as the reciprocal and square root in mathematics). All the source and target data for Vector computation must be stored in the Unified Buffer and 32-byte aligned.
Scalar	Performs Scalar computation and controls the program flow. It can be regarded as a mini CPU, which implements iteration control, selection judgment, address and parameter computations of Cube/Vector instructions, and basic arithmetic operations for a program.

Storage Units

The AI Core loads external data to the internal storage for computation. The internal storage of the AI Core includes the L1 Buffer, L0 Buffer, Unified Buffer, and Scalar Buffer.

To facilitate data transfer and movement in the AI Core, a bus interface unit (BIU) and memory transfer engines MTE1, MTE2, and MTE3 are provided. The BIU offers an interface for interaction between AI Core and the bus, while MTEs move data between different buffers.

The size of a storage unit depends on the Ascend AI Processor version. You can obtain the size of a storage unit by using the get_soc_spec call.

Table 2 shows the internal storage list of the AI Core.

**Table 2** Description of storage units
Storage Unit	Description
MTE	The AI Core provides multiple MTEs, which manage read/write of internal data between different buffers, convert formats, and perform other operations such as padding, transposing, and Img2Col.
BIU	As the gate of the AI Core, the BIU is responsible for the interaction between the AI Core and the bus. The BIU is an interface for the AI Core to read data from and write data to external storage. It converts the read and write requests of the AI Core into the requests of the bus and completes protocol interaction.
L1 Buffer	As a general internal storage, the L1 Buffer offers a large data buffer in the AI Core. It can temporarily store data that needs to be repeatedly used in the AI Core, thereby reducing the frequency that data is read from or written to the bus. For the format conversion function of some MTEs, the source data must be stored in the L1 Buffer, for example, during the Img2Col operation where a 3D image is converted to a 2D matrix.
L0A Buffer / L0B Buffer	They store the inputs of Cube instructions.
L0C Buffer	It stores the output of Cube instructions, which is part of the inputs for accumulation.
Unified Buffer	The Unified Buffer stores the inputs and outputs of Vector and Scalar computations.
Scalar Buffer	General buffer for Scalar calculations, which works with General-Purpose Registers (GPRs) as the secondary storage unit.
GPR	GPRs store the inputs and outputs of Scalar computation. You do not need to pay attention to these registers, which are encapsulated in the system. When the program accesses the Scalar Buffer and performs Scalar computation, the system automatically synchronizes the Scalar Buffer with the GPR.
SPR	Special-purpose registers (SPRs) are a group of configuration registers of the AI Core. You can partially adjust the behaviors of the AI Core by modifying SPR content.

In this document, the storage units of AI Core are classified into the following types:

Cache: invisible to programmers. When an instruction specifies the access to a lower-level storage unit, the cache can cache data to accelerate the access.
Buffer: visible to programmers and used to temporarily store data during vector or scalar computation.
Register: visible to programmers and usually used for scalar computation.

The AI Core storage units can be accessed only by using specific instructions. Figure 2 shows the relationships between the storage units and instructions opened to developers.

Figure 2 Relationship between instructions and storage access

The storage units in the preceding figure are software-level illustration.

The Scalar Buffer corresponds to the hardware-level Scalar Buffer.
Unified Buffer corresponds to the hardware-level Unified Buffer.
L1 Buffer corresponds to the hardware-level L1 Buffer.
L1Out Buffer is a storage unit abstracted from L0C for storing the output of the Cube Unit.

Currently, only development of vector operators is supported, since high-performance Cube operators are difficult to develop.

Control Units

Control units provide instruction control for the entire computing process, responsible for running the AI Core. Table 3 lists the control units of the AI Core. System Control is responsible for the operation of AI Core, parameter configuration, and power consumption control. After instructions are sequentially transmitted by using the Instruction Dispatch module, they are separately sent to the Cube Queue, Vector Queue, or MTE Queue module by type.

A control unit prefetches subsequent instructions during instruction execution and reads multiple instructions into the cache at a time, improving instruction execution efficiency. Multiple instructions are transmitted from the system memory to the instruction cache module (Instruction Cache) of the AI Core through the BIU, waiting to be decoded or computed by hardware. After being decoded, instructions are imported to the scalar queue to implement address decoding and operation control.

**Table 3** Description of control units
Control Unit	Description
System Control	It externally controls the Task Scheduler and sets parameters for AI Core initialization, for example, configuring PC, Para_base, and BlockID information. This unit offers functions such as controlling block execution, reporting the interrupt and status after block execution, and reporting the execution error.
Instruction Cache	As internal instruction cache memory of the AI Core, I Cache enables instruction prefetch.
Scalar PSQ	It holds scalar instructions for processing.
Instruction Dispatch	With Cube Unit, Vector Unit, and MTE instructions processed by Scalar PSQ and elements such as addresses and parameters configured, Instruction Dispatch dispatches the instructions to the corresponding instruction pipes by type and waits for the corresponding execution units to schedule and execute the instructions.
Cube Queue	Cube pipe. Instructions in the same pipe are executed in sequence, and instructions in different pipes can be executed in parallel.
Vector Queue	Vector pipe. Instructions in the same pipe are executed in sequence, and instructions in different pipes can be executed in parallel.
MTE Queue	MTE pipe. Instructions in the same pipe are executed in sequence, and instructions in different pipes can be executed in parallel.
Event Sync	It controls the dependency and synchronization between instructions across pipes.

Pipes

AI Core reads instructions in order and executes them in parallel, as shown in the following figure.

Figure 3 Instruction scheduling of AI Core

The instructions are decoded in order and executed in either of the following mode:

If an instruction is a Scalar instruction, it will be executed immediately.

If an instruction is not a Scalar instruction, it will be scheduled to the according pipe and then allocated to an idle compute unit for execution.

The following table lists the pipes.

**Table 4** Pipes
Pipe Abbreviation	Pipe Name	Description
V	Vector pipe	Schedules Vector instructions.
M	Matrix pipe	Schedules Cube instructions.
MTE1	MTE pipe 1	Schedules MTE instructions of the following types: L1 to L0A/L0B/UB, or L0A/L0B Buffer initialization using the SPR
MTE2	MTE pipe 2	Schedules MTE instructions of the following types: GM to L1/L0A/L0B/UB
MTE3	MTE pipe 3	Schedules MTE instructions of the following types: UB to GM

Instructions are dispatched to different pipes. Including the S pipe (for Scalar instructions), an AI Core has six pipes: S, V, M, MTE1, MTE2, and MTE3.

Except for the S pipe, instructions in different pipes can be executed out of order. However, instructions within a pipe are executed in order. That is, on the premise that data dependency is met, the instructions can be executed regardless of the programmed order.

Hardware dispatches the instructions to different pipes based on the delivery sequence. The Barrier and set_flag/wait_flag instructions are provided by Ascend AI Processor for intra- and inter-pipe synchronization.

Barrier synchronizes the instructions in the same pipe. Instructions after the barrier cannot issue until all instructions before the barrier are committed.
set_flag and wait_flag are a pair of inter-pipe synchronization instructions.
- set_flag: The current instruction starts to be executed after all read and write operations of the current instruction are completed and the corresponding flag bit in hardware is set to 1.
- wait_flag: When this instruction is executed, if the corresponding flag bit is 0, the subsequent instructions in the pipe are blocked; if the corresponding flag bit is 1, it is changed to 0, and subsequent instructions are executed.

Because TBE has encapsulated this dependency, you do not need to program Barrier, set_flag or wait_flag. However, they still need to understand this basic principle to achieve better synchronization through proper code scheduling. DSL provides the Auto Schedule mechanism, freeing you from the hassles associated with code schedule.

Parent topic: Background Knowledge