Storage Units

To fully leverage the powerful computing capability of the AI processors, the prerequisite is that the input data can be fed to the compute units in a timely and accurate manner. Therefore, the memory system needs to be elaborately designed to ensure the data supply required by the compute units.

An AI Core includes multiple levels of local memory (internal storage), and loads data from global memory (external storage) to local memory for computation. The local memory of an AI Core includes the L1 Buffer, L0 Buffer, Unified Buffer, and so on. An AI Core also contains MTEs to facilitate data movement and copy, which can convert the data format or type during data movement.

Table 1 shows the list of storage units in the local memory of an AI Core.

**Table 1** Description of storage units
Storage Unit	Description
L1 Buffer	As a general part of the local memory, the L1 Buffer offers a large data buffer in the AI Core. It can temporarily store data that needs to be repeatedly used in the AI Core, thereby reducing the frequency that data is read from or written to the bus.
L0A Buffer/L0B Buffer	They store the inputs of Cube instructions.
L0C Buffer	It stores the outputs of Cube instructions, which are part of the inputs for accumulation.
Unified Buffer	The Unified Buffer stores the inputs and outputs of Vector and Scalar computations.
BT Buffer	BiasTable Buffer, which stores biases.
FP Buffer	FixPipe buffer, which stores quantization parameters and ReLU parameters.

**Table 2** DMA units
DMA Unit	Description
MTE1	Moves data in the following paths: L1->L0A/L0B L1->UB (for the coupled architecture only) L1->BT Buffer (for the separated architecture only)
MTE2	Moves data in the following paths: GM->{L1, L0A/B}. In this path, data is moved based on the fractal size. If the data movement meets the cache line (512 bytes) alignment requirements, the performance is better. GM->UB: Data movement based on the cache line size will achieve better performance.
MTE3	Moves data in the following path: UB->GM.
FixPipe	Moves data in the following paths: (Only the separated architecture supports FixPipe. The data format or type can be converted during data movement.) L0C->{GM/L1} L1->FP Buffer

The size of a storage unit varies depending on the AI processor. Call GetCoreMemSize to get the size.
By default, all data that is read from or written to the GM through the DMA units is cached in the L2 cache to accelerate the access speed and improve the access efficiency. The L2 cache outside the core loads data in the unit of cache lines. The cache line size (128, 256, or 512 bytes) varies according to hardware specifications.

Parent topic: Hardware Architecture