Operator Implementation

Figure 1 shows the process of implementing vector operators based on Ascend C.

Figure 1 Implementation of vector operators

Operator analysis: Analyze the mathematical expression, input, output, and computation logic implementation of operators, and specify the Ascend C APIs to be called.
Kernel function definition: defines the entry point function of the Ascend C operator.
Implement the operator class according to the vector programming paradigm: Complete the internal implementation of the kernel function.

The following uses the ElemWise(Add) operator as an example to describe the steps in detail. For details about the complete code of the operator described in this sample, see add_custom.cpp.

Operator Analysis

Before developing the operator code, analyze the mathematical expression, input, output, and compute logic implementation of the operator, and specify the Ascend C APIs to be called.

Specify the mathematical expression and computational logic of the operator.
Specify the mathematical expression of the Add operator as follows:
```
z = x + y
```
The computation logic is as follows: The operation elements of the vector computation APIs provided by Ascend C are all LocalTensors. The input data needs to be moved to the on-chip storage, and then the computation API is called to add the two input parameters to obtain the final result. Then, the final result is moved to the external storage. The following figure shows the computation logic of the Ascend C Add operator.

Figure 2 Operator computation logic
Specify the input and output.
- The Add operator has two inputs, x and y, and outputs the result z.
- The supported input data type is half (float16), and so is the output data type.
- The operator supported input shape is (8, 2048), and so is the output shape.
- The operator input supports ND format.
Define the kernel function name and parameters.
- You can customize the kernel function name. In this example, the kernel function is named add_custom.
- Based on the analysis of the operator input and output, the kernel function has three parameters: x, y, and z. x and y indicate the memory address of the input in the global memory, and z indicates the memory address of the output in the global memory.
Specify the APIs required for operator implementation.
- Copy data between the local memory and global memory. For details about the data copy API, see the Ascend C API reference. Call DataCopy to copy data.
- This sample only involves the addition operation in vector computation. For details, see the vector compute API Vector Computation in the Ascend C API reference. Based on the preliminary analysis, the two-operand Add API Add can be used to implement x+y.
- The tensor data structure used in computation is managed by the queue, involving calling APIs such as EnQue and DeQue.

Based on the preceding analysis, the design specifications of the Ascend C Add operator are as follows.

**Table 1** Ascend C Add operator design specifications
OpType	Add
Operator input and output	name	shape	data type	format
	x (input)	(8, 2048)	half	ND
	y (input)	(8, 2048)	half	ND
	z (output)	(8, 2048)	half	ND
Kernel function name	add_custom
Main APIs	DataCopy: data movement API
	Add: two-operand vector instruction API
	EnQue, DeQue, and others: queue management APIs
Operator implementation file	add_custom.cpp

Kernel Function Definition

Define kernel functions based on the rules described in Kernel Function.

Function Prototype Definition

In this sample, the function name is add_custom (the kernel function name can be customized). Based on the analysis of the operator input and output in Operator Analysis, there are three parameters x, y, and z, where x and y indicate the input memory and z indicates the output memory. According to the rules of kernel functions described in Kernel Function, the function prototype is defined as follows: The __global__ function type qualifier is used to identify a kernel function that can be called by <<<>>>. The __aicore__ function type qualifier is used to identify that the kernel function is executed on the AI Core on the device. For convenience, the GM_ADDR macro is used to modify input parameters. For details about the definition of the GM_ADDR macro, see Kernel Function.

        
             extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
}

Call the Init and Process functions of the operator class.

The Init function of the operator class completes memory initialization, and the Process function completes the core logic of operator implementation. For details, see Operator Class Implementation.

         
              extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    KernelAdd op;
    op.Init(x, y, z);
    op.Process();
}

Encapsulate the calling of the kernel function to obtain the add_custom_do function for the main program to call. #ifndef ASCENDC_CPU_DEBUG indicates that the encapsulated function is used only when operators on the NPU side are compiled and run. When operators on the CPU side are compiled and run, the add_custom function can be directly called. According to Defining and Calling Kernel Functions, when the kernel function is called, in addition to the x, y, and z parameters, blockDim (number of cores executed by the kernel function), l2ctrl (reserved parameter, set to nullptr), and stream (stream for maintaining the execution sequence of asynchronous operations in the application) also need to be input, to specify the execution configuration of the kernel function.

        
             #ifndef ASCENDC_CPU_DEBUG
// call of kernel function
void add_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z)
{
    add_custom<<<blockDim, l2ctrl, stream>>>(x, y, z);
}
#endif

Operator Class Implementation

The kernel function calls the Init and Process functions of the operator class. This section describes how to implement the operator class based on the programming paradigm.

The following describes how to design the implementation process of the Add operator based on the vector programming paradigm. For details about the vector programming paradigm, see Vector Programming Paradigm. For details about the implementation process of the Add operator, see Figure 3.

The implementation process of the Add operator consists of three basic tasks: CopyIn, Compute, and CopyOut. The CopyIn task moves the input tensors xGm and yGm from the global memory to the local memory and stores them in xLocal and yLocal, respectively. The Compute task performs the addition operation on xLocal and yLocal, and stores the computation result in zLocal. The CopyOut task moves the output data from zLocal to the output tensor zGm in the global memory.
CopyIn and Compute tasks communicate and synchronize with each other through the VECIN queues inQueueX and inQueueY. Compute and CopyOut tasks communicate and synchronize with each other through the VECOUT queue outQueueZ.
The memory used for interaction between tasks and the memory used by temporary variables are managed by the pipe memory management object.

Figure 3 Add operator implementation workflow

The workflow is mainly implemented by the operator class, including the initialization function Init and core processing function Process. The Process function calls the three basic tasks. Some private members used in operator implementation are also included, such as the GlobalTensor and VECIN and VECOUT queues. Members of the KernelAdd operator class are as follows:

      
       
         
         
           class KernelAdd {
public:
    __aicore__ inline KernelAdd() {}
    // Initialization function, which is used to initialize the memory
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z){}
    // Core processing function, which implements the operator logic and calls the private member functions CopyIn, Compute, and CopyOut to complete the three-level pipeline operation of the vector operator.
    __aicore__ inline void Process(){}

private:
    // Copy-in function, which is called by the core Process function to complete the processing in the CopyIn phase
    __aicore__ inline void CopyIn(int32_t progress){}
    // Compute function, which is called by the core Process function to complete the processing in the Compute phase.
    __aicore__ inline void Compute(int32_t progress){}
    // Copy-out function, which is called by the core Process function to complete the processing in the CopyOut phase
    __aicore__ inline void CopyOut(int32_t progress){}

private:
    AscendC::TPipe pipe;  // Pipe memory management object
    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY;  // Queue management object of the input data. The value of QuePosition is VECIN.
    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueZ;  // Queue management object of the output data. The value of QuePosition is VECOUT.
    AscendC::GlobalTensor<half> xGm;  // Object for managing the input and output global memory addresses. xGm and yGm are input, and zGm is output.
    AscendC::GlobalTensor<half> yGm;
    AscendC::GlobalTensor<half> zGm;
};

          

        

      
     

The initialization function performs the following operations:

Sets the global memory address of the input and output Global Tensors.

In this sample, multi-core parallel computing is used. Data is split and allocated to multiple cores for processing. Ascend C kernel function is a processing function on a core. Only part of the data is processed. You need to obtain the memory offset address of the input and output to be processed by the kernel function in the global memory from the initialization function and set the offset address in the GlobalTensor.

For example, to obtain the memory offset address of input x in the global memory, run the following command:

       
            xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);

The allocation scheme in this sample is as follows: The total data length TOTAL_LENGTH is 8 x 2048. The data is evenly allocated to eight cores. The size (BLOCK_LENGTH) of data processed by each core is 2048. x + BLOCK_LENGTH * GetBlockIdx() is the memory offset address of x in the global memory of the single-core processing program. After the offset address is obtained, call the SetGlobalBuffer API of the GlobalTensor class to set the start address and length of the global memory of the core. For details, see Figure 4.

Figure 4 Multi-core parallel processing

The pipe memory management object is used to allocate memory for the input and output queues.

For example, allocating memory to the queue of input x can be implemented by using the following code snippet:

       
            pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half))

For data processed on a single core, data tiling is applicable. In this sample, data is tiled into eight blocks. This number of blocks is for reference only, which does not mean the optimal performance. Each data block can be tiled into two sub-blocks. That is, to enable double buffer for parallel processing between pipelines.

In this way, the data (2048 bytes) on a single core is tiled into 16 blocks, with each block TILE_LENGTH of 128 bytes. The preceding code indicates that the pipe allocates two memory blocks whose size is TILE_LENGTH x sizeof(half) bytes to inQueueX. Each memory block can contain TILE_LENGTH (128) half-type data. Figure 5 shows the data tiling.

Figure 5 Single-core data tiling

The initialization function code is as follows:

     
      
        
        
          constexpr int32_t TOTAL_LENGTH = 8 * 2048;                            // total length of data
constexpr int32_t USE_CORE_NUM = 8;                                   // num of core used
constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;         // length computed of each core
constexpr int32_t TILE_NUM = 8;                                       // split data into 8 tiles for each core
constexpr int32_t BUFFER_NUM = 2;                                     // tensor num for each queue
constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; // separate to 2 parts, due to double buffer
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    // get start index for current core, core parallel
    xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
    yGm.SetGlobalBuffer((__gm__ half*)y + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
    zGm.SetGlobalBuffer((__gm__ half*)z + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
    // pipe alloc memory to queue, the unit is Bytes
    pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));
    pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half));
    pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));
}

         

       

     
    

Based on the vector programming paradigm, the implementation of kernel functions is divided into three basic tasks: CopyIn, Compute, and CopyOut. The three functions are called in the Process function as follows:

     
              __aicore__ inline void Process()
    {
        // loop count need to be doubled, due to double buffer
        constexpr int32_t loopCount = TILE_NUM * BUFFER_NUM;
        // tiling strategy, pipeline parallel
        for (int32_t i = 0; i < loopCount; i++) {
            CopyIn(i);
            Compute(i);
            CopyOut(i);
        }
    }

Based on the algorithm analysis of the programming paradigm, the entire computation is divided into three stages. You need to compile the code of each stage separately. For details about the process, see Figure 3.

Stage 1: CopyIn function implementation.

Copy data from the GlobalTensor to the LocalTensor by calling DataCopy.
Call EnQue to place the LocalTensor in the VECIN queue.

       
            __aicore__ inline void CopyIn(int32_t progress)
    {
        // alloc tensor from queue memory
        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
        // copy progress_th tile from global tensor to local tensor
        AscendC::DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH);
        AscendC::DataCopy(yLocal, yGm[progress * TILE_LENGTH], TILE_LENGTH);
        // enque input tensors to VECIN queue
        inQueueX.EnQue(xLocal);
        inQueueY.EnQue(yLocal);
    }

Stage 2: Compute function implementation.

Call DeQue to obtain the LocalTensor from the VECIN queue.
Call the Ascend C API Add to perform vector computation.
Call EnQue to place the computation result of LocalTensor to the VECOUT queue.
Call FreeTensor to release the LocalTensor that is no longer used.

       
            __aicore__ inline void Compute(int32_t progress)
{
    // deque input tensors from VECIN queue
    AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
    AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
    AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
    // call Add instr for computation
    AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH);
    // enque the output tensor to VECOUT queue
    outQueueZ.EnQue<half>(zLocal);
    // free input tensors for reuse
    inQueueX.FreeTensor(xLocal);
    inQueueY.FreeTensor(yLocal);
}

Stage 3: CopyOut function implementation.

Call DeQue to obtain the LocalTensor from the VECOUT queue.
Call DataCopy to copy the LocalTensor to the GlobalTensor.
Call FreeTensor to reclaim the LocalTensors that are no longer used.

       
             __aicore__ inline void CopyOut(int32_t progress)
{
    // deque output tensor from VECOUT queue
    AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
    // copy progress_th tile from local tensor to global tensor
    AscendC::DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH);
    // free output tensor for reuse
    outQueueZ.FreeTensor(zLocal);
}

Running and Verification

After the kernel function (operator kernel program) is developed, you can compile the kernel function calling program on the host to call the operator from the application on the host, and verify its running. There are two methods for verifying the running the CPU and NPU respectively:

The verification on the CPU side is implemented by calling the ICPU_RUN_KF CPU debugging macro and other APIs provided by the CPU debugging library.
The verification on the NPU side is implemented by calling the kernel launch API or the kernel launch symbol <<<>>>, and the runtime API provided by AscendCL.

For details, see Kernel Launch.

Parent topic: Vector Programming