Overview
Ascend C operator implementation consists of two parts:
- Tiling implementation on the host
As internal storage of the AI Core in the NPU cannot store all the input and output data of operators, the input data is tiled into different parts. The first part is transferred in, computed, and then transferred out, so does the next part. This process is called tiling. The algorithm for splitting data is called the tiling algorithm or tiling strategy. Then, a computation program, called tiling implementation or tiling function, determines tiling parameters (such as the block size transferred each time and the total number of cycles) based on operator information such as shape. The AI Core is not good at scalar computation in the tiling implementation. Therefore, this computation is executed on the host CPU independently.
- Kernel implementation on the device
Kernel implementation refers to the implementation of operator kernel function. In the kernel function, the tiling structure transferred from the host is parsed to obtain the tiling information, which is used to control the process of transferring data in and out of the local memory. The operator logic is implemented by calling the computing, data transfer, memory management, and task synchronization APIs. The core logic is that computing-intensive tasks need to be executed on the NPU.
This section describes the operator tiling and kernel implementation in three typical scenarios: vector programming, matrix programming, and fusion operator programming. It also describes more programming details and API usage.