Operator Implementation
Figure 1 shows the process of implementing vector operators based on Ascend C.
- Operator analysis: Analyze the mathematical expression, input, output, and computation logic implementation of operators, and specify the Ascend C APIs to be called.
- Kernel function definition: defines the entry point function of the Ascend C operator.
- Implement the operator class according to the vector programming paradigm: Complete the internal implementation of the kernel function.
The following uses the ElemWise(Add) operator as an example to describe the steps in detail. For details about the complete code of the operator described in this sample, see add_custom.cpp.
Operator Analysis
Before developing the operator code, analyze the mathematical expression, input, output, and compute logic implementation of the operator, and specify the Ascend C APIs to be called.
- Specify the mathematical expression and computational logic of the operator.
Specify the mathematical expression of the Add operator as follows:
z = x + y
The computation logic is as follows: The operation elements of the vector computation APIs provided by Ascend C are all LocalTensors. The input data needs to be moved to the on-chip storage, and then the computation API is called to add the two input parameters to obtain the final result. Then, the final result is moved to the external storage. The following figure shows the computation logic of the Ascend C Add operator.
Figure 2 Operator computation logic
- Specify the input and output.
- The Add operator has two inputs, x and y, and outputs the result z.
- The supported input data type is half (float16), and so is the output data type.
- The operator supported input shape is (8, 2048), and so is the output shape.
- The operator input supports ND format.
- Define the kernel function name and parameters.
- You can customize the kernel function name. In this example, the kernel function is named add_custom.
- Based on the analysis of the operator input and output, the kernel function has three parameters: x, y, and z. x and y indicate the memory address of the input in the global memory, and z indicates the memory address of the output in the global memory.
- Specify the APIs required for operator implementation.
- Copy data between the local memory and global memory. For details about the data copy API, see the Ascend C API reference. Call DataCopy to copy data.
- This sample only involves the addition operation in vector computation. For details, see the vector compute API Vector Computation in the Ascend C API reference. Based on the preliminary analysis, the two-operand Add API Add can be used to implement x+y.
- The tensor data structure used in computation is managed by the queue, involving calling APIs such as EnQue and DeQue.
Based on the preceding analysis, the design specifications of the Ascend C Add operator are as follows.
|
OpType |
Add |
|||
|---|---|---|---|---|
|
Operator input and output |
name |
shape |
data type |
format |
|
x (input) |
(8, 2048) |
half |
ND |
|
|
y (input) |
(8, 2048) |
half |
ND |
|
|
z (output) |
(8, 2048) |
half |
ND |
|
|
Kernel function name |
add_custom |
|||
|
Main APIs |
DataCopy: data movement API |
|||
|
Add: two-operand vector instruction API |
||||
|
EnQue, DeQue, and others: queue management APIs |
||||
|
Operator implementation file |
add_custom.cpp |
|||
Kernel Function Definition
Define kernel functions based on the rules described in Kernel Function.
- Function Prototype Definition
In this sample, the function name is add_custom (the kernel function name can be customized). Based on the analysis of the operator input and output in Operator Analysis, there are three parameters x, y, and z, where x and y indicate the input memory and z indicates the output memory. According to the rules of kernel functions described in Kernel Function, the function prototype is defined as follows: The __global__ function type qualifier is used to identify a kernel function that can be called by <<<>>>. The __aicore__ function type qualifier is used to identify that the kernel function is executed on the AI Core on the device. For convenience, the GM_ADDR macro is used to modify input parameters. For details about the definition of the GM_ADDR macro, see Kernel Function.
1 2 3
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z) { }
- Call the Init and Process functions of the operator class.The Init function of the operator class completes memory initialization, and the Process function completes the core logic of operator implementation. For details, see Operator Class Implementation.
1 2 3 4 5 6
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z) { KernelAdd op; op.Init(x, y, z); op.Process(); }
- Encapsulate the calling of the kernel function to obtain the add_custom_do function for the main program to call. #ifndef ASCENDC_CPU_DEBUG indicates that the encapsulated function is used only when operators on the NPU side are compiled and run. When operators on the CPU side are compiled and run, the add_custom function can be directly called. According to Defining and Calling Kernel Functions, when the kernel function is called, in addition to the x, y, and z parameters, blockDim (number of cores executed by the kernel function), l2ctrl (reserved parameter, set to nullptr), and stream (stream for maintaining the execution sequence of asynchronous operations in the application) also need to be input, to specify the execution configuration of the kernel function.
1 2 3 4 5 6 7
#ifndef ASCENDC_CPU_DEBUG // call of kernel function void add_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z) { add_custom<<<blockDim, l2ctrl, stream>>>(x, y, z); } #endif
Operator Class Implementation
The kernel function calls the Init and Process functions of the operator class. This section describes how to implement the operator class based on the programming paradigm.
The following describes how to design the implementation process of the Add operator based on the vector programming paradigm. For details about the vector programming paradigm, see Vector Programming Paradigm. For details about the implementation process of the Add operator, see Figure 3.
- The implementation process of the Add operator consists of three basic tasks: CopyIn, Compute, and CopyOut. The CopyIn task moves the input tensors xGm and yGm from the global memory to the local memory and stores them in xLocal and yLocal, respectively. The Compute task performs the addition operation on xLocal and yLocal, and stores the computation result in zLocal. The CopyOut task moves the output data from zLocal to the output tensor zGm in the global memory.
- CopyIn and Compute tasks communicate and synchronize with each other through the VECIN queues inQueueX and inQueueY. Compute and CopyOut tasks communicate and synchronize with each other through the VECOUT queue outQueueZ.
- The memory used for interaction between tasks and the memory used by temporary variables are managed by the pipe memory management object.
The workflow is mainly implemented by the operator class, including the initialization function Init and core processing function Process. The Process function calls the three basic tasks. Some private members used in operator implementation are also included, such as the GlobalTensor and VECIN and VECOUT queues. Members of the KernelAdd operator class are as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
class KernelAdd { public: __aicore__ inline KernelAdd() {} // Initialization function, which is used to initialize the memory __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z){} // Core processing function, which implements the operator logic and calls the private member functions CopyIn, Compute, and CopyOut to complete the three-level pipeline operation of the vector operator. __aicore__ inline void Process(){} private: // Copy-in function, which is called by the core Process function to complete the processing in the CopyIn phase __aicore__ inline void CopyIn(int32_t progress){} // Compute function, which is called by the core Process function to complete the processing in the Compute phase. __aicore__ inline void Compute(int32_t progress){} // Copy-out function, which is called by the core Process function to complete the processing in the CopyOut phase __aicore__ inline void CopyOut(int32_t progress){} private: AscendC::TPipe pipe; // Pipe memory management object AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY; // Queue management object of the input data. The value of QuePosition is VECIN. AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueZ; // Queue management object of the output data. The value of QuePosition is VECOUT. AscendC::GlobalTensor<half> xGm; // Object for managing the input and output global memory addresses. xGm and yGm are input, and zGm is output. AscendC::GlobalTensor<half> yGm; AscendC::GlobalTensor<half> zGm; }; |
The initialization function performs the following operations:
- Sets the global memory address of the input and output Global Tensors.
In this sample, multi-core parallel computing is used. Data is split and allocated to multiple cores for processing. Ascend C kernel function is a processing function on a core. Only part of the data is processed. You need to obtain the memory offset address of the input and output to be processed by the kernel function in the global memory from the initialization function and set the offset address in the GlobalTensor.
For example, to obtain the memory offset address of input x in the global memory, run the following command:
1xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
The allocation scheme in this sample is as follows: The total data length TOTAL_LENGTH is 8 x 2048. The data is evenly allocated to eight cores. The size (BLOCK_LENGTH) of data processed by each core is 2048. x + BLOCK_LENGTH * GetBlockIdx() is the memory offset address of x in the global memory of the single-core processing program. After the offset address is obtained, call the SetGlobalBuffer API of the GlobalTensor class to set the start address and length of the global memory of the core. For details, see Figure 4.
- The pipe memory management object is used to allocate memory for the input and output queues.
For example, allocating memory to the queue of input x can be implemented by using the following code snippet:
1pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half))
For data processed on a single core, data tiling is applicable. In this sample, data is tiled into eight blocks. This number of blocks is for reference only, which does not mean the optimal performance. Each data block can be tiled into two sub-blocks. That is, to enable double buffer for parallel processing between pipelines.
In this way, the data (2048 bytes) on a single core is tiled into 16 blocks, with each block TILE_LENGTH of 128 bytes. The preceding code indicates that the pipe allocates two memory blocks whose size is TILE_LENGTH x sizeof(half) bytes to inQueueX. Each memory block can contain TILE_LENGTH (128) half-type data. Figure 5 shows the data tiling.
The initialization function code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
constexpr int32_t TOTAL_LENGTH = 8 * 2048; // total length of data constexpr int32_t USE_CORE_NUM = 8; // num of core used constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM; // length computed of each core constexpr int32_t TILE_NUM = 8; // split data into 8 tiles for each core constexpr int32_t BUFFER_NUM = 2; // tensor num for each queue constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; // separate to 2 parts, due to double buffer __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z) { // get start index for current core, core parallel xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH); yGm.SetGlobalBuffer((__gm__ half*)y + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH); zGm.SetGlobalBuffer((__gm__ half*)z + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH); // pipe alloc memory to queue, the unit is Bytes pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half)); pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half)); pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half)); } |
Based on the vector programming paradigm, the implementation of kernel functions is divided into three basic tasks: CopyIn, Compute, and CopyOut. The three functions are called in the Process function as follows:
1 2 3 4 5 6 7 8 9 10 11 |
__aicore__ inline void Process() { // loop count need to be doubled, due to double buffer constexpr int32_t loopCount = TILE_NUM * BUFFER_NUM; // tiling strategy, pipeline parallel for (int32_t i = 0; i < loopCount; i++) { CopyIn(i); Compute(i); CopyOut(i); } } |
Based on the algorithm analysis of the programming paradigm, the entire computation is divided into three stages. You need to compile the code of each stage separately. For details about the process, see Figure 3.
- Stage 1: CopyIn function implementation.
- Copy data from the GlobalTensor to the LocalTensor by calling DataCopy.
- Call EnQue to place the LocalTensor in the VECIN queue.
1 2 3 4 5 6 7 8 9 10 11 12
__aicore__ inline void CopyIn(int32_t progress) { // alloc tensor from queue memory AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>(); AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>(); // copy progress_th tile from global tensor to local tensor AscendC::DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH); AscendC::DataCopy(yLocal, yGm[progress * TILE_LENGTH], TILE_LENGTH); // enque input tensors to VECIN queue inQueueX.EnQue(xLocal); inQueueY.EnQue(yLocal); }
- Stage 2: Compute function implementation.
- Call DeQue to obtain the LocalTensor from the VECIN queue.
- Call the Ascend C API Add to perform vector computation.
- Call EnQue to place the computation result of LocalTensor to the VECOUT queue.
- Call FreeTensor to release the LocalTensor that is no longer used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
__aicore__ inline void Compute(int32_t progress) { // deque input tensors from VECIN queue AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>(); AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>(); AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>(); // call Add instr for computation AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH); // enque the output tensor to VECOUT queue outQueueZ.EnQue<half>(zLocal); // free input tensors for reuse inQueueX.FreeTensor(xLocal); inQueueY.FreeTensor(yLocal); }
- Stage 3: CopyOut function implementation.
- Call DeQue to obtain the LocalTensor from the VECOUT queue.
- Call DataCopy to copy the LocalTensor to the GlobalTensor.
- Call FreeTensor to reclaim the LocalTensors that are no longer used.
1 2 3 4 5 6 7 8 9
__aicore__ inline void CopyOut(int32_t progress) { // deque output tensor from VECOUT queue AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>(); // copy progress_th tile from local tensor to global tensor AscendC::DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH); // free output tensor for reuse outQueueZ.FreeTensor(zLocal); }
Running and Verification
After the kernel function (operator kernel program) is developed, you can compile the kernel function calling program on the host to call the operator from the application on the host, and verify its running. There are two methods for verifying the running the CPU and NPU respectively:
- The verification on the CPU side is implemented by calling the ICPU_RUN_KF CPU debugging macro and other APIs provided by the CPU debugging library.
- The verification on the NPU side is implemented by calling the kernel launch API or the kernel launch symbol <<<>>>, and the runtime API provided by AscendCL.
For details, see Kernel Launch.



