Operator Development Based on Kernel Launch Project

This document guides you through the following tasks, so you can experience the basic process of Ascend C operator development.

  1. Use Ascend C to develop the kernel function of the Add operator.
  2. Use the ICPU_RUN_KF CPU debugging macro to implement the runtime verification of the operator kernel function on the CPU.
  3. Use the <<<>>> kernel launch symbol to implement the runtime verification of the operator kernel function on the NPU.

Before development, set up the environment and analyze operators. The Ascend C operator development process is illustrated in the following figure.

Figure 1 Ascend C operator development process
  • Click vector operator sample to obtain the sample code.
  • To use this guide, you only need to have a basic knowledge of C or C++. If you have a basic understanding of the Ascend C programming model, you can deepen your understanding of the theory in practice. If you have no idea about the Ascend C programming model, you can try to run the samples in the guide first, and further progress your study following the instructions at the end of the guide.

Environment Setup

  • CANN software installation

    Before developing an operator, set up the development environment and operating environment. For details about the development environment and operating environment and how to install them, see CANN Software Installation Guide.

  • Environment variable configuration

    After the CANN software is installed, when you build and run your application as the CANN running user, log in to the environment as the CANN running user and run the source ${install_path}/set_env.sh command to set environment variables. {install_path} indicates the CANN installation path, for example, /usr/local/Ascend/ascend-toolkit.

Operator Analysis

Analyze the mathematical expression, input, output, and computation logic implementation of operators, and specify the Ascend C APIs to be called.

  1. Specify the mathematical expression and computation logic of an operator.

    Specify the mathematical expression of the Add operator as follows.

    Computation logic: Ascend C provides LocalTensors as the operation elements of the vector compute API. Input data is first copied to the local memory of AI Core, then the compute API adds two arguments to get the result, and the result is finally copied to the global memory.

  2. Specify the input and output.
    • The Add operator has two inputs, x and y, and outputs the result z.
    • In this sample, the supported input data type is half (float16), and so is the output data type.
    • The operator supported input shape is (8, 2048), and so is the output shape.
    • The supported operator input is ND.
  3. Define the kernel function name and parameters.
    • In this sample, the kernel function is named add_custom.
    • Based on the analysis of the operator input and output, the kernel function has three parameters: x, y, and z. x and y indicate the memory address of the input in the global memory, and z indicates the memory address of the output in the global memory.
  4. Specify the APIs required for operator implementation.
    • Copy data between the local memory and global memory. For details about the data copy API, see the Ascend C API reference. Call DataCopy to copy data.
    • This sample only involves the addition operation in vector computation. For details, see the vector compute API Vector Computation in the Ascend C API reference. Based on the preliminary analysis, the two-operand Add API Add can be used to implement x+y.
    • Call AllocTensor and FreeTensor to allocate and free tensor data used in the computation.
    • Use queues for communication and synchronization between parallel pipelines, involving APIs such as EnQue and DeQue.

Based on the preceding analysis, the design specifications of the Ascend C Add operator are as follows.

Table 1 Ascend C Add operator design specifications

OpType

AddCustom

Operator input

name

shape

data type

format

x

(8, 2048)

half

ND

y

(8, 2048)

half

ND

Operator output

z

(8, 2048)

half

ND

Kernel function name

add_custom

Main APIs

DataCopy: data movement API

Add: two-operand vector instruction API

AllocTensor and FreeTensor: memory management APIs

EnQue and DeQue: queue management APIs

Operator implementation file

add_custom.cpp

Kernel Function Development

After the environment is set up and the operator is preliminarily analyzed, you can start to develop the Ascend C kernel function. Obtain the sample code directory from vector operator sample first. The following sample code is implemented in the add_custom.cpp file.

In this sample, multi-core parallel computing is used. Data is tiled and allocated to multiple cores for processing. The Ascend C kernel function is a processing function on a core. It only processes part of the data. The allocation scheme is as follows: The total data length TOTAL_LENGTH is 8 x 2048. The data is evenly allocated to eight cores. The size (BLOCK_LENGTH) of data processed by each core is 2048. The following kernel function focuses only on how to process the data with the BLOCK_LENGTH.

  1. First, define the kernel function based on the rules described in Defining and Calling Kernel Functions and call the operator class functions Init and Process in the kernel function.
    1
    2
    3
    4
    5
    6
    extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
    {
        KernelAdd op;
        op.Init(x, y, z);
        op.Process();
    }
    
    • The __global__ function type qualifier is used to identify a kernel function that can be called by <<<>>>. The __aicore__ function type qualifier is used to specify that the kernel function is executed on the AI Core of the device. The variable type qualifier __gm__ needs to be added to the pointer argument variable, indicating that the pointer variable points to a memory address in the global memory. To unify the expression, use the GM_ADDR macro to modify the argument. The definition of the GM_ADDR macro is as follows:
      1
      #define GM_ADDR __gm__ uint8_t*
      
    • The Init function of the operator class completes memory initialization, and the Process function completes the core logic of operator implementation.
  2. Then, implement the operator class based on the vector programming paradigm. In this sample, the KernelAdd operator class is defined, with the following members:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    class KernelAdd {
    public:
        __aicore__ inline KernelAdd(){}
        // Initialization function, which is used to initialize memory.
        __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z){}
        // Core processing function, which implements the operator logic and calls the private member functions CopyIn, Compute, and CopyOut to complete the three-level pipeline operation of the vector operator
        __aicore__ inline void Process(){}
    
    private:
        // Move-in function, which completes the processing in the CopyIn phase. This function is called by the core Process function.
        __aicore__ inline void CopyIn(int32_t progress){}
        // Compute function, which completes the processing in the Compute phase. This function is called by the core Process function.
        __aicore__ inline void Compute(int32_t progress){}
        // Move-out function, which completes the processing in the CopyOut phase. This function is called by the core Process function.
        __aicore__ inline void CopyOut(int32_t progress){}
    
    private:
        AscendC::TPipe pipe;  // Pipe memory management object
        AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY;  // Queue management object of the input data. The value of QuePosition is VECIN.
        AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueZ;  // Queue management object of the output data. The value of QuePosition is VECOUT.
        AscendC::GlobalTensor<half> xGm;  // Object for managing the input and output global memory addresses. xGm and yGm are input, and zGm is output.
        AscendC::GlobalTensor<half> yGm;
        AscendC::GlobalTensor<half> zGm;
    };
    

    The following figure illustrates the relationship between internal functions.

    Figure 2 Relationship between kernel functions

    In addition to the initialization of the Init function, the pipeline tasks are moved-in, computed, and moved-out in Process. You can focus on the implementation of these tasks.

  3. The initialization function Init is used to set the global memory addresses of the input and output global tensors, and allocate memory for the input and output queues by using the pipe memory management object.

    In this sample, data is divided into eight blocks and evenly distributed to eight cores. The size (BLOCK_LENGTH) of data processed by each core is 2048. So how do we implement data tiling?

    The address of the data processed on each core needs to be obtained by adding an offset of GetBlockIdx() * BLOCK_LENGTH (length of the data processed by each block) to the start address. In this way, data tiling of multi-core parallel computing is implemented.

    Taking input x as an example, x + BLOCK_LENGTH * GetBlockIdx() is the memory offset address of x in the global memory of the single-core processing program. After the offset address is obtained, call the SetGlobalBuffer API of the GlobalTensor type to set the start address and length of the global memory of the core. The following figure shows the process.

    Figure 3 Multi-core parallel processing

    The above example implements data tiling in the multi-core processing scenario. How do we implement data tiling in the single-core processing scenario?

    For data processed on a single core, data tiling is applicable. In this sample, data is tiled into eight blocks. This number of blocks is for reference only, which does not mean the optimal performance. Each data block can be tiled into two sub-blocks to enable double buffer for parallel processing between pipelines.

    In this way, the data (2048 bytes) on a single core is tiled into 16 blocks, with each block TILE_LENGTH of 128 bytes. The pipe allocates two memory blocks whose size is TILE_LENGTH * sizeof(half) bytes to inQueueX. Each memory block can contain the number of TILE_LENGTH (128) half-type data. See the following figure.

    Figure 4 Single-core data tiling

    The initialization function code is as follows:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    #include "kernel_operator.h"
    
    constexpr int32_t TOTAL_LENGTH = 8 * 2048;                            // total length of data
    constexpr int32_t USE_CORE_NUM = 8;                                   // num of core used
    constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;         // length computed of each core
    constexpr int32_t TILE_NUM = 8;                                       // split data into 8 tiles for each core
    constexpr int32_t BUFFER_NUM = 2;                                     // tensor num for each queue
    constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; // separate to 2 parts, due to double buffer
    
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)
    {
            // get start index for current core, core parallel
            xGm.SetGlobalBuffer((__gm__ half*)x + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
            yGm.SetGlobalBuffer((__gm__ half*)y + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
            zGm.SetGlobalBuffer((__gm__ half*)z + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
            // pipe alloc memory to queue, the unit is Bytes
            pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));
            pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half));
            pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));
    }
    
  4. Based on the vector programming paradigm, the implementation of kernel functions is divided into three basic tasks: CopyIn, Compute, and CopyOut. The three functions are called in the Process function as follows:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
        __aicore__ inline void Process()
        {
            // loop count need to be doubled, due to double buffer
            constexpr int32_t loopCount = TILE_NUM * BUFFER_NUM;
            // tiling strategy, pipeline parallel
            for (int32_t i = 0; i < loopCount; i++) {
                CopyIn(i);
                Compute(i);
                CopyOut(i);
            }
        }
    
    1. Implement the CopyIn function.
      1. Call DataCopy to copy data from GlobalTensor to LocalTensor.
      2. Call EnQue to place the LocalTensor in the VECIN queue.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      __aicore__ inline void CopyIn(int32_t progress)
          {
              // alloc tensor from queue memory
              AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
              AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
              // copy progress_th tile from global tensor to local tensor
              AscendC::DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH);
              AscendC::DataCopy(yLocal, yGm[progress * TILE_LENGTH], TILE_LENGTH);
              // enque input tensors to VECIN queue
              inQueueX.EnQue(xLocal);
              inQueueY.EnQue(yLocal);
          }
      
    2. Implement the Compute function.
      1. Call DeQue to obtain the LocalTensor from the VECIN queue.
      2. Call the Ascend C API Add to perform Vector computation.
      3. Call EnQue to place the compute result of LocalTensor to the VECOUT queue.
      4. Call FreeTensor to release the LocalTensor that is no longer used.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      __aicore__ inline void Compute(int32_t progress)
      {
          // deque input tensors from VECIN queue
          AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
          AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
          AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
          // call Add instr for computation
          AscendC::Add(zLocal, xLocal, yLocal, TILE_LENGTH);
          // enque the output tensor to VECOUT queue
          outQueueZ.EnQue<half>(zLocal);
          // free input tensors for reuse
          inQueueX.FreeTensor(xLocal);
          inQueueY.FreeTensor(yLocal);
      }
      
    3. Implement the CopyOut function.
      1. Call DeQue to obtain the LocalTensor from the VECOUT queue.
      2. Call DataCopy to copy the LocalTensor to the GlobalTensor.
      3. Call FreeTensor to reclaim the LocalTensor that is no longer used.
      1
      2
      3
      4
      5
      6
      7
      8
      9
       __aicore__ inline void CopyOut(int32_t progress)
      {
          // deque output tensor from VECOUT queue
          AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
          // copy progress_th tile from local tensor to global tensor
          AscendC::DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH);
          // free output tensor for reuse
          outQueueZ.FreeTensor(zLocal);
      }
      

Runtime Verification of the Kernel Function

In CANN, kernel-side NPU and host-side CPU work together. After the kernel function is developed on the kernel, you can compile the kernel function calling program on the host to call operators from the application on the host to execute computation.

In addition to the kernel function implementation file add_custom.cpp, the following files are required for calling and verifying kernel functions:

  • Application that calls the operator: main.cpp.
  • Script file for generating input data and truth data: gen_data.py.
  • Script for verifying whether the output data is consistent with the truth value: verify_result.py.
  • Compilation project file for compiling the operator running on the CPU or NPU: CMakeLists.txt.
  • Script for compiling and running the operator: run.sh.

This document describes only how to compile the program that calls the operator. The program is contained in main.cpp. You can obtain other content from vector operator sample.

  1. Compile the program framework on the host.
    The built-in macro ASCENDC_CPU_DEBUG is a flag for distinguishing the logic of running on the CPU and NPU. In the same main function, the running programs on the CPU and NPU are distinguished through the definition of the ASCENDC_CPU_DEBUG macro.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    #include "data_utils.h"
    #ifndef ASCENDC_CPU_DEBUG
    #include "acl/acl.h"
    extern void add_custom_do(uint32_t coreDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z);
    #else
    #include "tikicpulib.h"
    extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z);
    #endif
    
    int32_t main(int32_t argc, char* argv[])
    {
        size_t inputByteSize = 8 * 2048 * sizeof(uint16_t);  // uint16_t represent half
        size_t outputByteSize = 8 * 2048 * sizeof(uint16_t);  // uint16_t represent half
        uint32_t blockDim = 8;
    
    #ifdef ASCENDC_CPU_DEBUG
        // Program for calling CPU debugging APIs
        
    #else
        // Program for calling the operator on the NPU
    
    #endif
        return 0;
    }
    
  2. Compile the program for calling CPU debugging APIs.
    To implement the runtime verification of the operator kernel function on the CPU, perform the following steps.
    Figure 5 Runtime verification on the CPU
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
        // Call GmAlloc to allocate shared memory and initialize data.
        uint8_t* x = (uint8_t*)AscendC::GmAlloc(inputByteSize);
        uint8_t* y = (uint8_t*)AscendC::GmAlloc(inputByteSize);
        uint8_t* z = (uint8_t*)AscendC::GmAlloc(outputByteSize);
    
        ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);
        ReadFile("./input/input_y.bin", inputByteSize, y, inputByteSize);
        // Call the ICPU_RUN_KF debugging macro to call the kernel function on the CPU.
        AscendC::SetKernelMode(KernelMode::AIV_MODE);
        ICPU_RUN_KF(add_custom, blockDim, x, y, z); // use this macro for cpu debug
        // Write output data.
        WriteFile("./output/output_z.bin", z, outputByteSize);
        // Call GmFree to release allocated resources.
        AscendC::GmFree((void *)x);
        AscendC::GmFree((void *)y);
        AscendC::GmFree((void *)z);
    
  3. Compile the program for running operators on the NPU.
    To implement the runtime verification of the operator kernel function on the NPU, perform the following steps.
    Figure 6 Runtime verification on the NPU
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
        // Initialize AscendCL.
        CHECK_ACL(aclInit(nullptr));
        // Allocate runtime resources.
        int32_t deviceId = 0;
        CHECK_ACL(aclrtSetDevice(deviceId));
        aclrtStream stream = nullptr;
        CHECK_ACL(aclrtCreateStream(&stream));
        // Allocate the host buffer.
        uint8_t *xHost, *yHost, *zHost;
        uint8_t *xDevice, *yDevice, *zDevice;
        CHECK_ACL(aclrtMallocHost((void**)(&xHost), inputByteSize));
        CHECK_ACL(aclrtMallocHost((void**)(&yHost), inputByteSize));
        CHECK_ACL(aclrtMallocHost((void**)(&zHost), outputByteSize));
        // Allocate the device buffer.
        CHECK_ACL(aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
        CHECK_ACL(aclrtMalloc((void**)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
        CHECK_ACL(aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
        // Initialize the host buffer.
        ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);
        ReadFile("./input/input_y.bin", inputByteSize, yHost, inputByteSize);
        CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
        CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, yHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
        // Use the kernel launch symbol <<<>>> to call the kernel function to complete the specified operation. <<<>>> call is encapsulated in add_custom_do.
        add_custom_do(blockDim, nullptr, stream, xDevice, yDevice, zDevice);
        CHECK_ACL(aclrtSynchronizeStream(stream));
        // Copy the computation result from the device to the host.
        CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
        WriteFile("./output/output_z.bin", zHost, outputByteSize);
        // Release allocated resources.
        CHECK_ACL(aclrtFree(xDevice));
        CHECK_ACL(aclrtFree(yDevice));
        CHECK_ACL(aclrtFree(zDevice));
        CHECK_ACL(aclrtFreeHost(xHost));
        CHECK_ACL(aclrtFreeHost(yHost));
        CHECK_ACL(aclrtFreeHost(zHost));
        // Deinitialize AscendCL.
        CHECK_ACL(aclrtDestroyStream(stream));
        CHECK_ACL(aclrtResetDevice(deviceId));
        CHECK_ACL(aclFinalize());
    
  4. Use the script to compile and run applications in one-click mode.
    The script execution mode is as follows. <soc_version> indicates the model of the AI processor where the operator runs, and <run_mode> indicates that the operator runs in CPU or NPU mode.
    bash run.sh -r <run_mode> -v <soc_version>  
    1. Before running the script, specify the ASCEND_INSTALL_PATH environment variable to the CANN installation path. The following is an example. Change it to the actual installation path.
      export ASCEND_INSTALL_PATH=$HOME/Ascend/ascend-toolkit/latest
    2. Verify the running on the CPU and NPU.
      • Run the following command if the CPU is used. Replace <soc_version> in the command with the actual AI processor model.
        bash run.sh -r cpu -v <soc_version>

        The command output is as below. The NumPy API is used to compute the absolute error and relative error between the output data and the truth value. If the deviation is within the tolerable range, the precision meets the requirement and test pass is output.

      • Run the following command if the NPU is used. Replace <soc_version> in the command with the actual AI processor model.
        bash run.sh -r npu -v <soc_version>

        The command output is as below. The NumPy API is used to compute the absolute error and relative error between the output data and the truth value. If the deviation is within the tolerable range, the precision meets the requirement and test pass is output.

      The AI processor model <soc_version> can be obtained in the following ways:

      • Run the npu-smi info command on the server where the Ascend AI Processor is installed to obtain the Chip Name information. The actual value is AscendChip Name. For example, if Chip Name is xxxyy, the actual value is Ascendxxxyy.
      The following models are supported:
      • Atlas Training Series Product

Follow-Up Guide

If you are not familiar with concepts such as multi-core parallelism and pipeline programming in this guide, refer to Programming Model to learn the basic concepts and then review this guide. If you understand related concepts and can run the sample through, refer to Vector Programming for more details about Ascend C vector programming.