Heterogeneous Computing

Ascend C programming involves developing code on two different platforms (host and device). This chapter briefly describes the differences between the host and device to help you understand the heterogeneous system. It also describes the operator data flow, to help you further understand how to properly arrange the executions of operator code in the heterogeneous architecture.

Differences Between the Host CPU and Device NPU

Different hardware resources
The CPU is designed to execute general-purpose computing tasks, but is not efficient in processing a large number of parallel computing tasks (such as matrix multiplication and batch processing). By contrast, the NPU aims to accelerate machine learning and deep learning tasks. It is good at performing a huge number of parallel computing tasks. The NPU contains a lot of dedicated hardware. Specifically, the Cube unit supports matrix computation, with a single core processing fp16 16 x 16 x 16 data within one clock cycle. The Vector unit supports vector computation, with a single core processing 128 fp16 additions within one clock cycle.
Different physical memory spaces
The physical memories of the host and device are separated, while sometimes data needs to be exchanged between them.

How to Properly Arrange Operator Code

The host and device can be deemed as a collaborative heterogeneous system for allocating the work that they are good at to each processing unit. Non-computing-intensive tasks (usually scalar tasks) are recommended on the host. Computing-intensive tasks are recommended on the device. The Single Instruction/Multiple Data (SIMD) of the device NPU can be used to efficiently implement matrix and vector operations for batch data.

Ascend C operator implementation consists of two parts:

Tiling implementation on the host
As internal storage of the AI Core on the NPU cannot store all the input and output data of operators, the input data is tiled into different parts. The first part is transferred in, computed, and then transferred out, so does the next part. This process is called tiling. The algorithm for splitting data is called the tiling algorithm or tiling strategy. Then, a computation program, called tiling implementation or tiling function, determines tiling parameters (such as the block size transferred each time and the total number of cycles) based on operator information such as shape. The AI Core is not good at scalar computation in the tiling implementation. Therefore, this computation is executed on the host CPU independently.
Kernel implementation on the device
Kernel implementation refers to the implementation of operator kernel function. In the kernel function, the tiling structure transferred from the host is parsed to obtain the tiling information, which is used to control the process of transferring data in and out of the local memory. The operator logic is implemented by calling the computing, data transfer, memory management, and task synchronization APIs. The core logic is that computing-intensive tasks need to be executed on the NPU.

Operator Data Flow

Data exchanges between the host and device happen during operator execution. The following provides the specific data flow for the passing of tiling parameters. First, the tiling algorithm on the host computes tiling parameters based on the operator input and output and stores the parameters in the tiling structure. Then, the tiling structure on the host is sent to the device. Finally, the operator on the device obtains and parses the tiling structure, and executes the subsequent operator compute logic based on the information.

Figure 1 Operator tiling data flow