Basic Vector Operators

The following figure shows the process for implementing basic vector operator kernel functions based on Ascend C.

Figure 1 Implementation of vector operator kernel functions
  • Analyze the operator: Analyze the mathematical expression, input, output, and computation logic implementation of the operator, and specify the Ascend C APIs to be called.
  • Define the kernel function: Define the entrypoint function of the Ascend C operator.
  • Implement the operator class according to the vector programming paradigm: Complete the internal implementation of kernel functions, including three basic tasks—CopyIn, Compute, and CopyOut.

The following uses the Add operator that runs on a single core and completes computation at a time, with the input data type being half and the last dimension of the shape being 32-byte aligned, as an example to detail the preceding steps. For details about the complete code of the operator described in this sample, see the basic Add operator sample.

Operator Analysis

The steps are as follows:

  1. Specify the mathematical expression and computation logic of an operator.

    The Add operator is used as an example.

    z = x + y

    The computation logic is as follows: The operation elements of the vector computation API provided by Ascend C are all LocalTensor. The input data needs to be moved from the external storage (Global Memory) to the on-chip storage (Global Memory), and then the computation API is called to add the two input parameters to obtain the final result. Then, the final result is moved to the external storage. The following figure shows the computation logic of the Ascend C Add operator.

    Figure 2 Operator computation logic
  2. Specify the input and output.
    • The Add operator has two inputs, x and y, and outputs the result z.
    • In this sample, the supported input data type is half (float16), and so is the output data type.
    • The supported input shape is (1, 2048), and so is the output shape.
    • The supported input format is ND.
  3. Define the kernel function name and parameters.
    • You can customize the kernel function name. In this sample, the kernel function is named add_custom.
    • Based on the analysis of the operator input and output, the kernel function has three parameters: x, y, and z. x and y indicate the memory address of the input in the global memory, and z indicates the memory address of the output in the global memory.
  4. Specify the APIs required for operator implementation.
    • Move data between the local memory and global memory. For details about the data movement API, see the Ascend C API reference. Call DataCopy to move data.
    • This sample only involves the addition operation in vector computation. For details, see the vector computation API in the Ascend C API reference. Based on the preliminary analysis, the basic arithmetic Add API Add can be used to implement x+y.
    • The tensor data structure used in computation is managed by the queue, involving calling APIs such as EnQue and DeQue.

Based on the preceding analysis, the design specifications of the Ascend C Add operator are as follows.

Table 1 Design specifications of the Ascend C Add operator

OpType

Add

Operator Input and Output

name

shape

data type

format

x (input)

(1, 2048)

half

ND

y (input)

(1, 2048)

half

ND

z (output)

(1, 2048)

half

ND

Kernel Function Name

add_custom

Main APIs

DataCopy: data movement API

Add: vector basic arithmetic API

EnQue, DeQue, and others: queue management APIs

Operator Implementation File

add_custom.cpp

Kernel Function Definition

Define kernel functions based on the rules described in Kernel Function.

  1. Define the function prototype.

    In this sample, the function name is add_custom (the kernel function name can be customized). Based on the analysis of the operator input and output in Operator Analysis, there are three parameters x, y, and z, where x and y indicate the input memory and z indicates the output memory. According to the rules described in Kernel Function, the function prototype is defined as follows: The __global__ function type qualifier is used to identify a kernel function that can be called by <<<>>>. The __aicore__ function type qualifier is used to identify that the kernel function is executed on the AI Core on the device. For convenience, the GM_ADDR macro is used to modify input parameters. For details about the definition of the GM_ADDR macro, see Kernel Function.

    1
    2
    3
    extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
    {
    }
    
  2. Call the Init and Process functions of the operator class.
    The Init function of the operator class completes memory initialization, and the Process function completes the core logic of operator implementation. For details, see Operator Class Implementation.
    1
    2
    3
    4
    5
    6
    extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
    {
        KernelAdd op;
        op.Init(x, y, z);
        op.Process();
    }
    
  3. Encapsulate the calling of the kernel function to obtain the add_custom_do function for the main program to call. #ifndef ASCENDC_CPU_DEBUG indicates that the encapsulated function is used only when operators on the NPU are built and run. When operators on the CPU are built and run, the add_custom function can be directly called. According to Defining and Calling Kernel Functions, when the kernel function is called, in addition to the x, y, and z parameters, blockDim (number of cores executed by the kernel function), l2ctrl (reserved parameter, set to nullptr), and stream (stream for maintaining the execution sequence of asynchronous operations in the application) also need to be input, to specify the execution configuration of the kernel function.
    1
    2
    3
    4
    5
    6
    7
    #ifndef ASCENDC_CPU_DEBUG
    // call of kernel function
    void add_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z)
    {
        add_custom<<<blockDim, l2ctrl, stream>>>(x, y, z);
    }
    #endif
    

Operator Class Implementation

The kernel function calls the Init and Process functions of the operator class. This section describes how to implement the operator class based on the programming paradigm.

The following describes how to design the implementation process of the Add operator based on the vector programming paradigm. For details about the vector programming paradigm, see Vector Programming Paradigm. For details about the implementation process of the Add operator, see Figure 3 Add operator implementation proces.

  • The implementation process of the Add operator consists of three basic tasks: CopyIn, Compute, and CopyOut. The CopyIn task moves the input tensors xGm and yGm from the global memory to the local memory and stores them in xLocal and yLocal, respectively. The Compute task performs the addition operation on xLocal and yLocal, and stores the computation result in zLocal. The CopyOut task moves the output data from zLocal to the output tensor zGm in the global memory.
  • CopyIn and Compute tasks synchronize with each other through the VECIN queues inQueueX and inQueueY. Compute and CopyOut tasks synchronize with each other through the VECOUT queue outQueueZ.
  • The memory used for interaction between tasks and the memory used by temporary variables are managed by TPipe.
Figure 3 Add operator implementation process

The process is mainly implemented by the operator class, including the initialization function Init and core processing function Process. The Process function calls the three basic tasks. Some private members used in operator implementation are also included, such as the GlobalTensor (xGm, yGm, and zGm) and VECIN and VECOUT queues. Members of the KernelAdd operator class are as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class KernelAdd {
public:
    __aicore__ inline KernelAdd() {}
    // Initialization function, which is used to initialize memory.
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z){}
    // Core processing function, which implements the operator logic and calls the private member functions CopyIn, Compute, and CopyOut to complete the three-level pipeline operation of the vector operator.
    __aicore__ inline void Process(){}

private:
    // Copy-in function, which is called by the core Process function to complete the processing in the CopyIn phase.
    __aicore__ inline void CopyIn(){}
    // Compute function, which is called by the core Process function to complete the processing in the Compute phase.
    __aicore__ inline void Compute(){}
    // Copy-out function, which is called by the core Process function to complete the processing in the CopyOut phase.
    __aicore__ inline void CopyOut(){}

private:
    AscendC::TPipe pipe;  // Pipe memory management object.
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX;  // Queue management object of the input data. The value of TPosition is VECIN.
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueY;  // Queue management object of the input data. The value of TPosition is VECIN.
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueZ;  // Queue management object of the output data. The value of TPosition is VECOUT.
    AscendC::GlobalTensor<half> xGm;  // Object for managing the global memory addresses of the input and output data. xGm and yGm are input, and zGm is output.
    AscendC::GlobalTensor<half> yGm;
    AscendC::GlobalTensor<half> zGm;
};

The initialization function performs the following operations:

  • Set the global memory address of the input and output global tensors.

    The allocation scheme in this sample is as follows: The total data length TOTAL_LENGTH is 1 × 2048. SetGlobalBuffer of the GlobalTensor class is called to set the start address and length of the global memory of the core.

    1
    xGm.SetGlobalBuffer((__gm__ half *)x, TOTAL_LENGTH);
    
  • Allocate memory for the input and output queues through TPipe.

    For example, allocating memory for the queue of input x can be implemented by using the following code snippet:

    1
    pipe.InitBuffer(inQueueX, 1, TOTAL_LENGTH * sizeof(half))
    

The initialization function code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
constexpr int32_t TOTAL_LENGTH = 1 * 2048;  // Total data length.
__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    // Set the start address and length of the global memory.
    xGm.SetGlobalBuffer((__gm__ half *)x, TOTAL_LENGTH);
    yGm.SetGlobalBuffer((__gm__ half *)y, TOTAL_LENGTH);
    zGm.SetGlobalBuffer((__gm__ half *)z, TOTAL_LENGTH);

    // The pipe memory management object is used to allocate memory for the input and output queues.
    pipe.InitBuffer(inQueueX, 1, TOTAL_LENGTH * sizeof(half));
    pipe.InitBuffer(inQueueY, 1, TOTAL_LENGTH * sizeof(half));
    pipe.InitBuffer(outQueueZ, 1, TOTAL_LENGTH * sizeof(half));
}

Based on the vector programming paradigm, the implementation of kernel functions is divided into three basic tasks: CopyIn, Compute, and CopyOut. The three functions are called in the Process function as follows:

1
2
3
4
5
6
__aicore__ inline void Process()
{
    CopyIn();
    Compute();
    CopyOut();
}

Based on the algorithm analysis of the programming paradigm, the entire computation is divided into three stages. You need to compile the code of each stage separately. For details about the process, see Figure 3.

  1. Implement the CopyIn function (stage 1).
    1. Call DataCopy to copy data from GlobalTensor to LocalTensor.
    2. Call EnQue to place LocalTensor in the VECIN queue.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    __aicore__ inline void CopyIn()
    {
        // Allocate memory for LocalTensor from the queue.
        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
        // Copy data from GlobalTensor to LocalTensor.
        AscendC::DataCopy(xLocal, xGm, TOTAL_LENGTH);
        AscendC::DataCopy(yLocal, yGm, TOTAL_LENGTH);
        // Place LocalTensor in the VECIN queue.
        inQueueX.EnQue(xLocal);
        inQueueY.EnQue(yLocal);
    }
    
  2. Implement the Compute function (stage 2).
    1. Call DeQue to obtain LocalTensor from VECIN.
    2. Call the Ascend C API Add to perform vector computation.
    3. Call EnQue to place the computation result of LocalTensor in the VECOUT queue.
    4. Call FreeTensor to release LocalTensor that is no longer used.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    __aicore__ inline void Compute()
    {
        // Obtain the input from the VECIN queue.
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = inQueueY.DeQue<half>();
        AscendC::LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
        // Call the Add operator to perform computation.
        AscendC::Add(zLocal, xLocal, yLocal, TOTAL_LENGTH);
        // Place the computation result of LocalTensor in the VECOUT queue.
        outQueueZ.EnQue<half>(zLocal);
        // Release LocalTensor.
        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);
    }
    
  3. Implement the CopyOut function (stage 3).
    1. Call DeQue to obtain LocalTensor from the VECOUT queue.
    2. Call DataCopy to copy data from LocalTensor to GlobalTensor.
    3. Call FreeTensor to reclaim LocalTensor that is no longer used.
    1
    2
    3
    4
    5
    6
    7
    8
    9
     __aicore__ inline void CopyOut()
    {
        // Obtain the computation result from the VECOUT queue.
        AscendC::LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
        // Copy the computation result from LocalTensor to GlobalTensor.
        AscendC::DataCopy(zGm, zLocal, TOTAL_LENGTH);
        // Release LocalTensor.
        outQueueZ.FreeTensor(zLocal);
    }