Customizing the Add Operator

This document guides you through the following tasks, so you can experience the basic process of Ascend C operator development.

  1. Analyze the operator to determine the mathematical expression and computation logic.
  2. Develop the kernel function of the Add operator.
  3. Run the kernel function for verification.

Before development, you need to set up the environment. The Ascend C operator development process is illustrated in the following figure.

Figure 1 Ascend C operator development process
  • Click the LINK to obtain the sample code.
  • To use this guide, you only need to have a basic knowledge of C or C++. You can deepen your theoretical understanding of the Ascend C programming model during practical operations based on the knowledge. If you have no idea about the Ascend C programming model, you can try to run the sample in the guide first, and further progress your study by referring to the instructions at the end of the guide.

Environment Setup

  • CANN software installation

    Before developing an operator, set up the development environment and operating environment. For details about the development environment and operating environment and how to install them, see CANN Software Installation Guide.

  • Environment variable configuration

    After the CANN package is installed, log in to the environment as the CANN operating user and run the source $INSTALL_DIR/set_env.sh command to set environment variables. Replace ${INSTALL_DIR} with the CANN component directory. For example, if the installation is performed by the root user, the default file storage path is /usr/local/Ascend/cann.

Operator Analysis

Analyze the mathematical expression, number of inputs and outputs, shape range, and computation logic implementation of an operator, and specify the Ascend C APIs to be called. The following uses the Add operator as an example to describe the analysis process.

  1. Specify the mathematical expression and computation logic of an operator.

    The Add operator is used as an example.

    The calculation logic is as follows: Data is moved from the global memory to the local memory, then the Ascend C compute API is used to add two input parameters to get a result, and the result is finally moved to the global memory.

  2. Specify the input and output.
    • The Add operator has two inputs, x and y, and outputs the result z.
    • The supported input data type is float, and so is the output data type.
    • The supported input shape is (8, 2048), and so is the output shape.
    • The supported input format is ND.
  3. Define the kernel function name and parameters.
    • In this sample, the kernel function is named add_custom.
    • Based on the operator input and output analysis, it is determined that the kernel function has three parameters: x, y, and z. x and y are the input parameters, and z is the output parameter.
  4. Specify the APIs required for operator implementation.
    • Move data between the local memory and global memory. For details about the data movement API, see the Ascend C API reference. Call DataCopy to move data.
    • This sample only involves the addition operation in vector computation. For details, see the vector compute API Vector Computation in the Ascend C API reference. Based on the preliminary analysis, the Add API Add can be used to implement x+y.
    • Call AllocTensor and FreeTensor to allocate and free tensor data used in the computation.
    • Use queues for synchronization between parallel pipelines, involving APIs such as EnQue and DeQue.

Based on the preceding analysis, the design specifications of the Ascend C Add operator are as follows.

Table 1 Design specifications of the Ascend C Add operator

OpType

AddCustom

Operator input

Name

Shape

Data Type

Format

x

(8, 2048)

float

ND

y

(8, 2048)

float

ND

Operator output

z

(8, 2048)

float

ND

Kernel function name

add_custom

Main APIs

DataCopy: data movement API

Add: vector basic arithmetic API

AllocTensor and FreeTensor: memory management APIs

EnQue and DeQue: queue management APIs

Operator implementation file

add_custom.asc

Kernel Function Development

After the environment is set up and the operator is preliminarily analyzed, you can start to develop the Ascend C kernel function. Obtain the sample code from LINK first. The following sample code is implemented in the add_custom.asc file.

In this sample, multi-core parallel computation is used. Data is tiled and allocated to multiple cores for processing. The Ascend C kernel function is a processing function on a core. It only processes part of the data. The allocation scheme is as follows: Assume that there eight cores are used. The total data length is defined as 8 × 2048. The data is evenly allocated across eight cores, each processing a data block of 2048 elements. For data processed on a single core, the data can also be tiled to implement parallel processing of data in pipelines.

  1. Design a structure AddCustomTilingData based on the allocation scheme to store parameters related to parallel data tiling. AddCustomTilingData defines two parameters: totalLength and tileNum. totalLength indicates that the total size of the data to be processed is 8 × 2048 elements, and tileNum indicates the number of data blocks to be computed by each core.
    1
    2
    3
    4
    5
    struct AddCustomTilingData
    {
        uint32_t totalLength;
        uint32_t tileNum;
    };
    
  2. Define the kernel function based on the rules described in Defining and Calling Kernel Functions and call the operator class functions Init and Process in the kernel function.
    1
    2
    3
    4
    5
    6
    7
    __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)
    {
        KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);    // Set the kernel type to Vector core (used for vector computation).
        KernelAdd op;
        op.Init(x, y, z, tiling.totalLength, tiling.tileNum);
        op.Process();
    }
    
    • The __global__ function type qualifier is used to identify a kernel function that can be called by <<<>>>. The __aicore__ function type qualifier is used to specify that the kernel function is executed on the AI Core of the device. The variable type qualifier __gm__ needs to be added to the pointer argument variable, indicating that the pointer variable points to a memory address in the global memory. To unify the expression, use the GM_ADDR macro to modify the argument. The definition of the GM_ADDR macro is as follows:
      1
      #define GM_ADDR __gm__ uint8_t*
      
    • The Init function of the operator class completes memory initialization, and the Process function completes the core logic of operator implementation.
  3. Implement the operator class based on the vector programming paradigm. In this sample, the KernelAdd operator class is defined, with the following members:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    class KernelAdd {
    public:
        __aicore__ inline KernelAdd(){}
        // Initialization function, which is used to initialize memory.
        __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum){}
        // Core processing function, which implements the operator logic and calls the private member functions CopyIn, Compute, and CopyOut to complete the three-level pipeline operation of the vector operator.
        __aicore__ inline void Process(){}
    
    private:
        // Copy-in function, which moves data from the global memory to the local memory. This function is called by the core Process function.
        __aicore__ inline void CopyIn(int32_t progress){}
        // Compute function, which is used to add two input parameters to get the final result. This function is called by the core Process function.
        __aicore__ inline void Compute(int32_t progress){}
        // Copy-out function, which moves the final result from the local memory to the global memory. This function is called by the core Process function.
        __aicore__ inline void CopyOut(int32_t progress){}
    
    private:
        AscendC::TPipe pipe;  // TPipe memory management object.
        AscendC::TQue<AscendC::TPosition::VECIN, BUFFER_NUM> inQueueX, inQueueY;  // Queue management object of the input data. The value of TPosition is VECIN.
        AscendC::TQue<AscendC::TPosition::VECOUT, BUFFER_NUM> outQueueZ;  // Queue management object of the output data. The value of TPosition is VECOUT.
        AscendC::GlobalTensor<float> xGm;  // Object for managing the global memory addresses of the input and output data. xGm and yGm are input, and zGm is output.
        AscendC::GlobalTensor<float> yGm;
        AscendC::GlobalTensor<float> zGm;
        uint32_t blockLength; // Length of data computed by each core
        uint32_t tileNum; // Number of data blocks to be computed by each core
        uint32_t tileLength; // Length of each data block in each core
    };
    

    The following figure illustrates the relationship between internal functions.

    Figure 2 Relationship between kernel functions

    In addition to the initialization of the Init function, the pipeline tasks are moved-in, computed, and moved-out in Process. You can focus on the implementation of these tasks.

  4. The initialization function Init is used to set the global memory addresses of the input and output global tensors, and allocate memory for the input and output queues through TPipe.

    In this sample, data is divided into eight blocks and evenly allocated across eight cores. The size of data processed by each core is 2048. So how do we implement data tiling?

    The address of the data processed on each core needs to be obtained by adding an offset of GetBlockIdx() × blockLength (length of the data processed by each block) to the start address. In this way, data tiling of multi-core parallel compute is implemented.

    Taking input x as an example, x + blockLength × GetBlockIdx() is the memory offset address of x in the global memory of the single-core processing program. After the offset address is obtained, call the SetGlobalBuffer API of the GlobalTensor type to set the start address and length of the global memory of the core. The following figure shows the process.

    Figure 3 Multi-core parallel processing

    The above example implements data tiling in the multi-core processing scenario. How do we implement data tiling in the single-core processing scenario?

    For data processed on a single core, data tiling is applicable. In this sample, data is tiled into eight blocks. This number of blocks is for reference only, which does not mean the optimal performance. Each data block can be tiled into two sub-blocks to enable double buffering for parallel processing between pipelines.

    Therefore, the data (2048 elements) on a single core is partitioned into 16 tiles, each with a tileLength of 128. TPipe allocates two memory blocks whose size is tileLength × sizeof(float) bytes to inQueueX. Each memory block can contain the number of tileLength (128) half-type data. The following figure shows the data tiling.

    Figure 4 Single-core data tiling

    The initialization function code is as follows:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    // Header file required on the kernel
    #include "kernel_operator.h"
    
    constexpr int32_t BUFFER_NUM = 2;                                     // tensor num for each queue
    
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum)
    {
        
         this->blockLength = totalLength / AscendC::GetBlockNum();     // length computed of each core
         this->tileNum = tileNum;                                      // split data into 8 tiles for each core
         this->tileLength = this->blockLength / tileNum / BUFFER_NUM;  // separate to 2 parts, due to double buffer
         // get start index for current core, core parallel
         xGm.SetGlobalBuffer((__gm__ float *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
         yGm.SetGlobalBuffer((__gm__ float *)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
         zGm.SetGlobalBuffer((__gm__ float *)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
         // pipe alloc memory to queue, the unit is Bytes
         pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(float));
         pipe.InitBuffer(inQueueY, BUFFER_NUM, this->tileLength * sizeof(float));
         pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(float));
    }
    
  5. Based on the vector programming paradigm, the implementation of kernel functions is divided into three basic tasks: CopyIn, Compute, and CopyOut. The three functions are called in the Process function as follows:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    __aicore__ inline void Process()
    {
        // loop count need to be doubled, due to double buffer
        int32_t loopCount = this->tileNum * BUFFER_NUM;
        // tiling strategy, pipeline parallel
        for (int32_t i = 0; i < loopCount; i++) {
            CopyIn(i);
            Compute(i);
            CopyOut(i);
        }
    }
    
    1. Implement the CopyIn function.
      1. Call DataCopy to copy data from GlobalTensor to LocalTensor.
      2. Call EnQue to place LocalTensor in the VECIN queue.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      __aicore__ inline void CopyIn( int32_t progress)
      {
          // alloc tensor from queue memory
          AscendC::LocalTensor<float> xLocal = inQueueX.AllocTensor<float>();
          AscendC::LocalTensor<float> yLocal = inQueueY.AllocTensor<float>();
          // copy progress_th tile from global tensor to local tensor
          AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);
          AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength);
          // enque input tensors to VECIN queue
          inQueueX.EnQue(xLocal);
          inQueueY.EnQue(yLocal);
      }
      
    2. Implement the Compute function.
      1. Call DeQue to obtain the LocalTensor from the VECIN queue.
      2. Call the Ascend C API Add to perform vector computation.
      3. Call EnQue to place the computation result of LocalTensor in the VECOUT queue.
      4. Call FreeTensor to release LocalTensor that is no longer used.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      __aicore__ inline void Compute(int32_t progress)
      {
          // deque input tensors from VECIN queue
          AscendC::LocalTensor<float> xLocal = inQueueX.DeQue<float>();
          AscendC::LocalTensor<float> yLocal = inQueueY.DeQue<float>();
          AscendC::LocalTensor<float> zLocal = outQueueZ.AllocTensor<float>();
          // call Add instr for computation
          AscendC::Add(zLocal, xLocal, yLocal, this->tileLength);
          // enque the output tensor to VECOUT queue
          outQueueZ.EnQue<float>(zLocal);
          // free input tensors for reuse
          inQueueX.FreeTensor(xLocal);
          inQueueY.FreeTensor(yLocal);
      }
      
    3. Implement the CopyOut function.
      1. Call DeQue to obtain the LocalTensor from the VECOUT queue.
      2. Call DataCopy to copy data from the LocalTensor to the GlobalTensor.
      3. Call FreeTensor to reclaim the LocalTensor that is no longer used.
      1
      2
      3
      4
      5
      6
      7
      8
      9
       __aicore__ inline void CopyOut(int32_t progress)
      {
          // deque output tensor from VECOUT queue
          AscendC::LocalTensor<float> zLocal = outQueueZ.DeQue<float>();
          // copy progress_th tile from local tensor to global tensor
          AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);
          // free output tensor for reuse
          outQueueZ.FreeTensor(zLocal);
      }
      

Runtime Verification of the Kernel Function

After the kernel function is developed on the kernel, you can compile the kernel function calling program on the host, implementing the function of calling operators from the application program on the host to execute the computation process.

  1. Compile the program framework on the host.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    // Header file to be included in the application on the host
    #include "acl/acl.h"
    // Header file to be included in the kernel
    #include "kernel_operator.h"
    // Kernel function development
    ...
    
    __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, AddCustomTilingData tiling)
    {
        KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_AIV_ONLY);    
        KernelAdd op;
        op.Init(x, y, z, tiling.totalLength, tiling.tileNum);
        op.Process();
    }
    
    // Call the operator using the kernel launch symbol <<<...>>>.
    std::vector<float> kernel_add(std::vector<float> &x, std::vector<float> &y)
    {
    ...
    }
    
    
    // Compare the computation result.
    uint32_t VerifyResult(std::vector<float> &output, std::vector<float> &golden)
    {
        auto printTensor = [](std::vector<float> &tensor, const char *name) {
            constexpr size_t maxPrintSize = 20;
            std::cout << name << ": ";
            std::copy(tensor.begin(), tensor.begin() + std::min(tensor.size(), maxPrintSize),
                std::ostream_iterator<float>(std::cout, " "));
            if (tensor.size() > maxPrintSize) {
                std::cout << "...";
            }
            std::cout << std::endl;
        };
        printTensor(output, "Output");
        printTensor(golden, "Golden");
        if (std::equal(golden.begin(), golden.end(), output.begin())) {
            std::cout << "[Success] Case accuracy is verification passed." << std::endl;
            return 0;
        } else {
            std::cout << "[Failed] Case accuracy is verification failed!" << std::endl;
            return 1;
        }
        return 0;
    }
    
    // Main program for operator verification
    int32_t main(int32_t argc, char *argv[])
    {
        constexpr uint32_t totalLength = 8 * 2048;
        constexpr float valueX = 1.2f;
        constexpr float valueY = 2.3f;
        std::vector<float> x(totalLength, valueX);
        std::vector<float> y(totalLength, valueY);
    
        std::vector<float> output = kernel_add(x, y);
    
        std::vector<float> golden(totalLength, valueX + valueY);
        return VerifyResult(output, golden);
    }
    
  2. Write the code for calling the operator using the kernel launch symbol <<<...>>>.
    Figure 5 Procedure

    For details about how to use the acl APIs in the following example, see "acl API (C&C++)".

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    std::vector<float> kernel_add(std::vector<float> &x, std::vector<float> &y)
    {
        constexpr uint32_t blockDim = 8;
        uint32_t totalLength = x.size();
        size_t totalByteSize = totalLength * sizeof(float);
        int32_t deviceId = 0;
        aclrtStream stream = nullptr;
        AddCustomTilingData tiling = {/*totalLength:*/totalLength, /*tileNum:*/8};
        uint8_t *xHost = reinterpret_cast<uint8_t *>(x.data());
        uint8_t *yHost = reinterpret_cast<uint8_t *>(y.data());
        uint8_t *zHost = nullptr;
        uint8_t *xDevice = nullptr;
        uint8_t *yDevice = nullptr;
        uint8_t *zDevice = nullptr;
    
        // Initialization
        aclInit(nullptr);
        // Allocate runtime resources.
        aclrtSetDevice(deviceId);
        aclrtCreateStream(&stream);
        // Allocate the host memory.
        aclrtMallocHost((void **)(&zHost), totalByteSize);
        // Allocate the device memory.
        aclrtMalloc((void **)&xDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
        aclrtMalloc((void **)&yDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
        aclrtMalloc((void **)&zDevice, totalByteSize, ACL_MEM_MALLOC_HUGE_FIRST);
        // Copy the input data from the host to the device.
        aclrtMemcpy(xDevice, totalByteSize, xHost, totalByteSize, ACL_MEMCPY_HOST_TO_DEVICE);
        aclrtMemcpy(yDevice, totalByteSize, yHost, totalByteSize, ACL_MEMCPY_HOST_TO_DEVICE);
        // Use the kernel launch symbol <<<...>>> to call the kernel function to complete specified operations.
        add_custom<<<blockDim, nullptr, stream>>>(xDevice, yDevice, zDevice, tiling);
        aclrtSynchronizeStream(stream);
        // Copy the computation result from the device to the host.
        aclrtMemcpy(zHost, totalByteSize, zDevice, totalByteSize, ACL_MEMCPY_DEVICE_TO_HOST);
        std::vector<float> z((float *)zHost, (float *)(zHost + totalByteSize));
        // Release allocated resources.
        aclrtFree(xDevice);
        aclrtFree(yDevice);
        aclrtFree(zDevice);
        aclrtFreeHost(zHost);
        // Perform deinitialization.
        aclrtDestroyStream(stream);
        aclrtResetDevice(deviceId);
        aclFinalize();
        return z;
    }
    
  3. The CMake compilation configuration is as follows:
    cmake_minimum_required(VERSION 3.16)
    # find_package(ASC) is a command used in CMake to search for and configure the Ascend C compilation toolchain.
    find_package(ASC REQUIRED)
    # Specify that the project supports the ASC and CXX languages. ASC indicates that the Ascend C programming language can be compiled using the BiSheng compiler.
    project(kernel_samples LANGUAGES ASC CXX)
    
    add_executable(demo
        add_custom.asc
    )
    
    # Set the NPU architecture through the compilation option.
    target_compile_options(demo PRIVATE   
       $<$<COMPILE_LANGUAGE:ASC>:--npu-arch=dav-2201>
    )
  4. The compilation and running procedure is as follows:
    mkdir -p build && cd build; 
    cmake ..;make -j;
    ./demo
    • This sample supports only the following models:
      • Atlas A3 training products / Atlas A3 inference products
      • Atlas A2 training products / Atlas A2 inference products
    • --npu-arch specifies the NPU architecture version. The architecture version number follows dav-. For details about the architecture version number corresponding to each AI processor model, see Table 1.

Follow-Up Guide

If you are not familiar with concepts such as multi-core parallelism and pipeline programming in this guide, refer to Programming Model to learn the basic concepts and then review this guide. If you understand related concepts and can run the sample through, refer to Vector Programming for more details about Ascend C vector programming.