Architecture

As illustrated in the following figure, operators are developed based on Ascend C run on the AI Core, so beginners need to have a basic understanding of the hardware architecture. The programming model introduced in the following is based on the hardware architecture abstraction, and understanding this content will help you better understand the programming model. For experienced developers who need to complete high-performance programming, it is even more important to learn about the hardware architecture. This section provides the basis for many best practices introduced in this document.

The AI Core executes compute-intensive operators related to scalars, vectors, and tensors. In the AI Core, there are compute units (including the Cube Unit, Vector Unit, and Scalar Unit), storage units (including hardware storage and DMA units for data transfer), and control units. The hardware architecture is classified into coupled architecture and separated architecture based on whether the Cube Unit and Vector Unit are deployed on the same core. Product models and their processor architectures:

  • Atlas Training Series Product: coupled architecture

Coupled Architecture

The coupled architecture means that the Cube Unit and Vector Unit are deployed on the same core. See the following architecture diagram. The diagram illustrates the storage units and compute units in the core. Arrows indicate the data processing flow. MTE1/MTE2/MTE3 are the DMA units for data transfer (MTE stands for memory transfer engine).

Dotted arrow in the figure means:

  • For Atlas Training Series Product: Scalar cannot directly read and write data in the GM.

Separated architecture

In the separated architecture, an AI Core is split into two independent cores, the AI Cube (AIC) and AI Vector (AIV). Each has its own Scalar Unit and can independently load code segments. In this way, the matrix compute and vector compute are decoupled, while the compute efficiency is optimized under unified system software scheduling. Data flows between AIVs and AICs through the Global Memory. Compared with the coupled architecture, two buffers are added: BT Buffer (in which BT stands for BiasTable, used for storing the bias) and FP Buffer (in which FP stands for FixPipe, used for storing quantization parameters and ReLU parameters).

  • AIC architecture
    • Contains five parallel execution units (data transfer units and compute units): MTE1, MTE2, MTE3, Cube, and Scalar
    • Contains seven storage units: GM (outside the core), L1, L0A, L0B, L0C, BT Buffer, and FP Buffer
  • AIV architecture
    • Contains four parallel execution units: MTE2, MTE3, Vector, and Scalar
    • Contains two storage units: GM (outside the core) and UB
  • Typical compute data flow
    • Vector compute: GM-UB-[Vector]-UB-GM
    • Cube compute:
      • GM-L1-L0A/L0B-Cube-L0C-FixPipe-GM
      • GM-L1-L0A/L0B-Cube-L0C-FixPipe-L1