Compute Units
The compute units are the core of powerful compute capabilities of the AI Core, including the Cube Unit, Vector Unit, and Scalar Unit. These three units process different types of data in the AI Core.
Scalar
Performs scalar computation and controls the program flow. It can be regarded as a mini CPU, which implements iteration control, selection judgment, address and parameter computations of Cube/Vector instructions, and basic arithmetic operations for a program. In addition, the execution pipeline of other functional units in the AI Core can be controlled by inserting a synchronization code into the event synchronization module. Compared with the host CPU, Scalar has a weaker computing capability and is mainly used to send instructions. You are advised to minimize the if/else statement and variable computation during performance optimization.
As shown in the following figure, Scalar executes standard Arithmetic Logic Unit (ALU) statements when executing scalar operation instructions. The code segment and data segment (stack space) required by ALU come from GM. I-cache is used to cache code segments, and the cache size is related to hardware specifications. If the size is 16 KB or 32 KB, the code segment is loaded in the unit of 2 KB. D-cache is used to cache data segments, and the cache size is related to hardware specifications. If the size is 16 KB, the data segment is loaded in the unit of cache lines (64 bytes). Considering that access inside the core is efficient, ensure that code segments and data segments are cached in the I-cache and D-cache to avoid access outside the core. In addition, based on the data loading unit, you can adjust the size of data to be loaded at a time during programming to improve the loading efficiency. For example, when data is loaded to the D-cache and the start address of the data memory is aligned with the cache lines (64 bytes), the loading efficiency is the highest.

The hardware provides the L2 cache to cache the data (including code segments and data segments) that accesses the GM, thereby accelerating the access speed and improving the access efficiency. The L2 cache outside the core loads data in the unit of cache lines. The cache line size (128 bytes, 256 bytes, or 512 bytes) varies according to hardware specifications.
Vector
Performs vector computation. The Vector Unit executes vector instructions, which is similar to a conventional Single Instruction Multiple Data (SIMD) instruction. Each vector instruction can complete the same type of operation for a plurality of operands. As shown in the following figure, the Vector Unit can quickly add or multiply two FP16 vectors. Vector instructions support multiple iterations and direct computation of vectors with intervals.

All the computed source data and target data of the vector must be stored in the Unified Buffer (UB), and the start address and operation length must be 32-byte aligned.
Cube
The Cube Unit performs matrix operations. Matrix multiplication between matrix A (M x K) and matrix B (K x N) can be completed at a time. As shown in the following figure, the red dotted box shows the Cube Unit and its accessed storage units. The left matrix A is from L0A, the right matrix B is from L0B, and L0C stores the matrix multiplication result and intermediate result.
