Kernel Function

The kernel function is an entry for implementing an Ascend C operator on the device. Ascend C allows you to use the syntax extension of a C/C++ function to write the running code on the device. You can perform data access and compute operations in the kernel function to implement all functions of the operator. In contrast to a common C++ function that is executed only once, when a kernel function is called, the same kernel function code is executed on multiple cores in parallel, with the same input parameters.

When defining a kernel function, use the function type qualifiers __global__ and __aicore__. Add the variable type qualifier __gm__ to the pointer input parameter variable, indicating that the pointer variable points to a memory address in the global memory. Use the kernel launch symbol <<<...>>> to call the execution kernel function and specify the number of cores for executing the kernel function.

The following is a kernel function example (code snippet) of the Add operator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// Implement the kernel function.
__global__ __aicore__ void add_custom(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z)
{
    // Initialize the operator class. The operator class provides methods such as operator initialization and core processing.
    KernelAdd op;
    // Initialization function, which is used to obtain the input and output addresses to be processed by the kernel function and initialize the memory.
    op.Init(x, y, z);
    // Core processing function, which is used to complete core logic such as data movement and computation of operators.
    op.Process();
}

// Call the kernel function.
void add_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z)
{
    add_custom<<<blockDim, l2ctrl, stream>>>(x, y, z);
}

Defining and Calling Kernel Functions

Comply with the following rules when defining kernel functions:

  • Use function type qualifiers.

    In addition to defining kernel functions in C/C++ function declaration mode, you need to add additional function type qualifiers, including __global__ and __aicore__, to kernel functions.

    The __global__ function type qualifier is used to identify a kernel function that can be called by <<<...>>>. The __aicore__ function type qualifier is used to specify that the kernel function is executed on the AI Core of the device.

    1
    __global__ __aicore__ void kernel_name(argument list);
    

    Functions used in programming can be classified into three types: kernel functions (executed on the device), host functions, and device functions (except kernel functions). The following figure uses operator development based on the kernel launch project as an example to describe the calling relationships between the three types of functions.

    • Host functions can call host functions of the same type, similar to function calling in general C/C++ programming. Kernel functions can also be called by <<<...>>>.
    • Device functions (except kernel functions) can call device functions of the same type.
    • Kernel functions can call device functions (except kernel functions).
    Figure 1 Calling relationships between kernel functions, host functions, and device functions
  • Use variable type qualifiers.

    The variable type qualifier __gm__ needs to be added to the pointer input parameter variable, indicating that the pointer variable points to a memory address in the Global Memory.

  • Other rules or suggestions
    1. Rule: The kernel function must have the void return type.
    2. Rule: The input parameter must be a pointer or C/C++ built-in data type (primitive data types), for example, half* s0, float* s1, and int32_t c.
    3. Suggestion: To unify the expression, you are advised to use the GM_ADDR macro to modify the input parameter. The definition of the GM_ADDR macro is as follows:
      1
      #define GM_ADDR __gm__ uint8_t*
      

      An example of using GM_ADDR to modify the input parameter is as follows:

      1
      extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
      

      The pointer of the uint8_t type is used as an example. In subsequent use, the pointer needs to be converted into the actual pointer type.

The common function calling mode is as follows:

1
function_name(argument list);

The kernel function uses the kernel launch symbol <<<...>>> to specify the execution configuration of the kernel function.

1
kernel_name<<<blockDim, l2ctrl, stream>>>(argument list);

The kernel launch symbol can be called only when performing compilation on the NPU. It cannot be identified by compilation on the CPU.

The execution configuration is determined by the following parameters:
  • blockDim specifies the number of cores on which a kernel function will be executed. A logical ID, that is, block_idx, is allocated to each core that executes the kernel function, and block_idx can be obtained by calling GetBlockIdx during the kernel function implementation.

    blockDim is a concept about logical cores, and its value range is [1, 65535]. To fully utilize hardware resources, set this parameter to the number of physical cores or a multiple of the number of physical cores.

    • In coupling mode and separation mode, the meaning and setting rules of blockDim at runtime are different. The details are as follows:
      • Coupling mode: Because the Vector and Cube Units are integrated, blockDim is used to start multiple AI Core instances, without distinguishing between these units. The number of AI Cores can be obtained by calling GetCoreNumAiv or GetCoreNumAic.
      • Separation mode
        • For operators that contain only Vector Units, blockDim is used to set the number of vector (AIV) instances to be started. For example, if an AI processor has 40 vector cores, set blockDim to 40.
        • For operators that contain only Cube Units, blockDim is used to set the number of cube (AIC) instances to be started. For example, if an AI processor has 20 Cube cores, set blockDim to 20.
        • Operators for Vector/Cube fusion computing are started by groups of AIVs and AICs. blockDim is used to set the number of groups to be started. For example, if an AI processor has 40 Vector cores and 20 Cube cores, a group consists of two Vector cores and one Cube core. Set the number of groups to 20. In this case, 20 groups are started, including 40 Vector cores and 20 Cube cores. Note: In this scenario, the number of blockDim (logical cores) cannot exceed the number of physical cores (a physical core contains two Vector cores and one Cube core).
        • The number of AIC and AIV cores can be obtained by calling GetCoreNumAic and GetCoreNumAiv, respectively.
    • If the device resource limit feature is used, the value of blockDim set for the operator cannot exceed the number of cores returned by an API (like GetCoreNum, GetCoreNumAic, or GetCoreNumAiv) from PlatformAscendC. For example, if aclrtSetStreamResLimit is used to set the number of stream-level vector cores to 8, the return value of GetCoreNumAiv is 8. The value of blockDim set for the vector operator cannot exceed 8. Otherwise, resources of other streams are preempted, causing the resource limit to become invalid.
  • l2ctrl is a reserved parameter and can be ignored. This parameter is fixed at nullptr.
  • stream is of the aclrtStream type. Streams preserve the order of some asynchronous operations being executed on the device. For details about management APIs for stream creation and other operations, see "Stream Management".

The following kernel function add_custom is used to add two vectors. The following is a calling sample:

1
2
//If blockDim is set to 8, the add_custom kernel function is called on 8 cores. Each core independently executes the kernel function in parallel. The parameter lists of the kernel function are x, y, and z.
add_custom<<<8, nullptr, stream>>>(x, y, z);

The kernel function is called asynchronously. After the kernel function is called, the control right is returned to the host immediately. You can call aclrtSynchronizeStream to force the host program to wait until all kernel functions are executed.

1
aclError aclrtSynchronizeStream(aclrtStream stream);

Defining and Calling the Template Kernel Function

You can use a template to define a kernel function. The following is an example of defining a kernel function, which has two template parameters: a and T. a is a non-type template parameter, while T is a type template parameter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
template<int a, typename T>
__global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
...
    AscendC::printf("Print Template a: %d\n", a);
...
    xGm.SetGlobalBuffer((__gm__T*)x + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
    yGm.SetGlobalBuffer((__gm__T*)y + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
    zGm.SetGlobalBuffer((__gm__T*)z + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH);
...
}

The template kernel function is called as follows: add_custom<20, float>. This part of code calls the kernel function named add_custom and provides specific values for its template parameters.

1
add_custom<20, float><<<blockDim, nullptr, stream>>>(x, y, z);