Storage Units
To fully leverage the powerful computing capability of the AI processors, the prerequisite is that the input data can be fed to the compute units in a timely and accurate manner. Therefore, the memory system needs to be elaborately designed to ensure the data supply required by the compute units.
An AI Core includes multiple levels of local memory (internal storage), and loads data from global memory (external storage) to local memory for computation. The local memory of an AI Core includes the L1 Buffer, L0 Buffer, Unified Buffer, and so on. An AI Core also contains MTEs to facilitate data movement and copy, which can convert the data format or type during data movement.
Table 1 shows the list of storage units in the local memory of an AI Core.
|
Storage Unit |
Description |
|---|---|
|
L1 Buffer |
As a general part of the local memory, the L1 Buffer offers a large data buffer in the AI Core. It can temporarily store data that needs to be repeatedly used in the AI Core, thereby reducing the frequency that data is read from or written to the bus. |
|
L0A Buffer/L0B Buffer |
They store the inputs of Cube instructions. |
|
L0C Buffer |
It stores the outputs of Cube instructions, which are part of the inputs for accumulation. |
|
Unified Buffer |
The Unified Buffer stores the inputs and outputs of Vector and Scalar computations. |
|
BT Buffer |
BiasTable Buffer, which stores biases. |
|
FP Buffer |
FixPipe buffer, which stores quantization parameters and ReLU parameters. |
|
DMA Unit |
Description |
|---|---|
|
MTE1 |
Moves data in the following paths:
|
|
MTE2 |
Moves data in the following paths:
|
|
MTE3 |
Moves data in the following path: UB->GM. |
|
FixPipe |
Moves data in the following paths: (Only the separated architecture supports FixPipe. The data format or type can be converted during data movement.)
|
- The size of a storage unit varies depending on the AI processor. Call GetCoreMemSize to get the size.
- By default, all data that is read from or written to the GM through the DMA units is cached in the L2 cache to accelerate the access speed and improve the access efficiency. The L2 cache outside the core loads data in the unit of cache lines. The cache line size (128, 256, or 512 bytes) varies according to hardware specifications.