NPU Architecture Version 200x

This section describes the hardware architecture and functions of the __NPU_ARCH__ version 200x. 200 indicates the IP core ID, and x indicates the configuration version of the same IP core. The corresponding product model is Atlas inference products .

Hardware Architecture

Compute Units

Deployment of the Cube Unit and Vector Unit on the same core

In this architecture, the Cube Unit and Vector Unit are deployed on the same core and share the same Scalar Unit.

Vector Unit

  • The data of the Vector Unit comes from the Unified Buffer and must be 32-byte aligned.
  • Data is transferred from the L0C Buffer to the Unified Buffer through the Vector Unit.

Cube Unit

  • The Cube Unit can access the L0A Buffer, L0B Buffer, and L0C Buffer. The L0A Buffer stores the left cube, the L0B Buffer stores the right cube, and the L0C Buffer stores the cube multiplication result and intermediate result.

Storage Units

Obtaining the memory size of a storage unit

You can query the memory size of each storage unit by calling the platform information acquisition API.

Minimum access granularities (alignment requirements) of each storage unit

Storage Unit

Alignment Requirement

Unified Buffer

32-byte aligned.

L1 Buffer

32-byte aligned.

L0A Buffer

512-byte aligned.

L0B Buffer

512-byte aligned.

L0C Buffer

64-byte aligned.

Recommended data layout formats for each storage unit

  • The following fractal formats are recommended for the L0A Buffer, L0B Buffer, and L0C Buffer:
    • L0A Buffer: FRACTAL_ZZ
    • L0B Buffer: FRACTAL_ZN
    • L0C Buffer: FRACTAL_NZ

    These formats are optimized for compute-intensive tasks such as cube multiplication, significantly improving the compute efficiency.

  • The FRACTAL_NZ format is recommended for the L1 Buffer. When the L1 Buffer uses the NZ format, the format conversion overhead is reduced when data is moved to the L0A/L0B Buffer (which needs to be converted to the ZZ and ZN formats, respectively).
  • The Unified Buffer has no requirements on the data format.

Resolving access conflicts of storage units to improve read/write performance

When multiple operations attempt to access the same bank or bank group of the Unified Buffer at the same time, bank conflicts may occur, including the read-write conflict, write-write conflict, and read-read conflict. As a result, the accesses queue and the performance deteriorates. The read/write performance can be improved by optimizing bank allocation. For details, see Avoiding Bank Conflicts in the Unified Buffer.

DMA Units

Alignment requirements during movement

The size of the data to be moved is subject to certain requirements because the moved data is used for computation. The size of the data moved to the Unified Buffer must be aligned based on the data block size. The data moved to other storage units must be aligned based on the fractal requirements. For example, when data is moved from the L1 Buffer to the L0A Buffer, the data format needs to be converted from NZ to ZZ. The size of the data to be moved must be aligned based on the fractal size. If the remaining size of the L1 Buffer is less than one fractal, an exception will occur during hardware execution.

Synchronization Control

Intra-core synchronization

The execution units (such as the MTE2 and Vector Unit) within the AI Core operate asynchronously and in parallel. As a result, data dependencies may arise when reading data from or writing data to the local memory (such as the Unified Buffer). To ensure data consistency and computational accuracy, synchronization control is required to coordinate the timing of operations.

For example, in the process where the MTE2 moves data from the GM to the UB for Abs computation by the Vector Unit and then moves the data back to the GM, the following synchronization conditions must be met:

  1. Data movement and computation sequence
    • Start Abs computation by the Vector Unit after data is moved from the GM to the UB (to avoid data loss caused by incomplete movement during computation).
    • After the vector computation is complete, move data from the UB to the GM (to ensure that the result data is ready).
  2. Synchronization rules for cyclic data movement and computation scenarios
    • Start new data movement after the previous computation is complete. Do not trigger new data movement when the previous computation is not complete (to prevent the old data in the UB from being overwritten).
    • Start new computation after the previous data is moved out. Do not trigger a new computation task when the previous data is not completely moved out from the UB (to avoid overwriting conflicts in the target memory area).

The synchronization control process is shown in the following figure.

In the preceding figure, ID1, ID2, ID3, ID4, ID5, and ID6 represent event IDs. Each event ID corresponds to the movement status of a piece of stored data, ensuring the correctness and consistency of data operations.

Note the following:
  • You are advised to obtain the event ID through the AllocEventID or FetchEventID API to ensure its validity.
  • The number of event IDs is limited. After using an event ID, call ReleaseEventID immediately to release the event ID, preventing event ID exhaustion and ensuring normal system operating.
  • SetFlag and WaitFlag must be used in pairs, and their parameters must be completely the same (including the template parameters and event ID). If they do not match, the computation of the current core may be abnormal, or the execution of the operator on the next core may be affected, causing a timeout.

    For example, SetFlag<HardEvent::S_MTE3>(1) and SetFlag<HardEvent::MTE3_MTE1>(1) set different event IDs because their template parameters are different. They set the same event ID only when their template parameters and event IDs are the same.

  • Do not set the same event ID consecutively. Otherwise, the event status may be disordered or not correctly processed.

Inter-core synchronization

This hardware architecture does not support inter-core synchronization.