Abstract Hardware Architecture
AI Core is the compute core of Ascend AI Processor, and Ascend AI Processor has multiple AI Cores. This section describes the parallel compute architecture abstraction of AI Core, which shields the differences between different hardware. When using Ascend C for programming, the abstract hardware architecture simplifies hardware details and significantly lowers the development barrier. For more details about the hardware architecture or principles, see Hardware Implementation.
An AI Core consists of core components such as the compute units, storage units, and direct memory access (DMA) units.
- Compute units include three types of basic computing resources: Cube Unit, Vector Unit, and Scalar Unit.
- Storage units include local memory and global memory.
- The internal storage units of an AI Core is called local memory, with their data corresponding to the LocalTensor type.
- The external storage units that can be accessed by an AI Core is called global memory, with their data corresponding to the GlobalTensor type.
- Direct memory access (DMA) units move data between the global memory and local memory and between local memories at different levels.
|
Category |
Component |
Component Functions |
|---|---|---|
|
Compute Units |
Scalar |
Performs scalar computations such as address computation and loop control, and issues instructions of vector computation, matrix computation, data transfer, and synchronization to corresponding units for execution. |
|
Vector |
Performs vector operations. |
|
|
Cube |
Performs matrix operations. |
|
|
Storage Units |
Local Memory |
Internal storage of the AI Core. |
|
DMA Units |
DMA |
Moves data between the global memory and local memory and between local memories at different levels. |
Based on the understanding of hardware architecture abstraction, you need to note the following three processes: asynchronous instruction stream, synchronous signal stream, and computing data stream.
- Asynchronous parallel computing process of the AI Core: The Scalar Unit reads the instruction sequence and issues the vector computation, matrix computation, and data transfer instructions to the instruction queues of other units. The Vector Unit, Cube Unit, and DMA units asynchronously execute the received instructions in parallel. For details, see the blue instruction streams in Figure 1.
- Different instructions may depend on each other. To ensure that instructions in different instruction queues are executed based on the correct logic, the Scalar Units also issue synchronization instructions to other units. For details about the synchronization process between units, see the green synchronization signal streams in Figure 1.
- The internal data processing of the AI Core is as follows: The DMA copy-in unit moves data from the global memory to the local memory. The Vector/Cube Units complete data computation and write the computation result back to the local memory. Then, the DMA copy-out unit moves the processed data from the local memory back to the global memory. For details, see the red data streams in Figure 1.
