Storage Units

To fully leverage the powerful computing capability of the AI processors, the prerequisite is that the input data can be fed to the compute units in a timely and accurate manner. Therefore, the memory system needs to be elaborately designed to ensure the data supply required by the compute units.

An AI Core includes multiple levels of local memory (internal storage), and loads data from global memory (external storage) to local memory for computation. The local memory of an AI Core includes the L1 Buffer, L0 Buffer, Unified Buffer, and so on. An AI Core also contains MTEs to facilitate data movement and copy, which can convert the data format or type during data movement.

Table 1 shows the list of storage units in the local memory of an AI Core.

Table 1 Description of storage units

Storage Unit

Description

L1 Buffer

As a general part of the local memory, the L1 Buffer offers a large data buffer in the AI Core. It can temporarily store data that needs to be repeatedly used in the AI Core, thereby reducing the frequency that data is read from or written to the bus.

L0A Buffer/L0B Buffer

They store the inputs of Cube instructions.

L0C Buffer

It stores the outputs of Cube instructions, which are part of the inputs for accumulation.

Unified Buffer

The Unified Buffer stores the inputs and outputs of Vector and Scalar computations.

BT Buffer

BiasTable Buffer, which stores biases.

FP Buffer

FixPipe buffer, which stores quantization parameters and ReLU parameters.

Table 2 DMA units

DMA Unit

Description

MTE1

Moves data in the following paths:

  • L1->L0A/L0B
  • L1->UB (for the coupled architecture only)
  • L1->BT Buffer (for the separated architecture only)

MTE2

Moves data in the following paths:

  • GM->{L1, L0A/B}. In this path, data is moved based on the fractal size. If the data movement meets the cache line (512 bytes) alignment requirements, the performance is better.
  • GM->UB: Data movement based on the cache line size will achieve better performance.

MTE3

Moves data in the following path: UB->GM.

FixPipe

Moves data in the following paths: (Only the separated architecture supports FixPipe. The data format or type can be converted during data movement.)

  • L0C->{GM/L1}
  • L1->FP Buffer
  • The size of a storage unit varies depending on the AI processor. Call GetCoreMemSize to get the size.
  • By default, all data that is read from or written to the GM through the DMA units is cached in the L2 cache to accelerate the access speed and improve the access efficiency. The L2 cache outside the core loads data in the unit of cache lines. The cache line size (128, 256, or 512 bytes) varies according to hardware specifications.