Kernel Function
The kernel function is an entry for implementing an Ascend C operator on the device. Ascend C allows you to use the syntax extension of a C/C++ function, such as a kernel function, to manage the running code on the device. You can perform data access and computing operations in the kernel function to implement all functions of the operator. In contrast to a common C++ function that is executed only once, when a kernel function is called, the same kernel function code is executed on multiple cores in parallel, with the same arguments.
When defining a kernel function, use the function type qualifiers __global__ and __aicore__. Add the variable type qualifier __gm__ to the pointer input parameter variable, indicating that the pointer variable points to a memory address in the global memory. Use the kernel launch symbol <<<>>> to call the execution kernel function and specify the number of cores for executing the kernel function.
The following is an example of the kernel function of the Add operator. For details, see the Add operator examples in Vector Programming.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
// Implement the kernel function. extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z) { // Initialize the operator class. The operator class provides methods such as operator initialization and core processing. KernelAdd op; // Initialization function, which is used to obtain the input and output addresses to be processed by the kernel function and initialize the memory. op.Init(x, y, z); // Core processing function, which is used to complete core logic such as data movement and computation of operators. op.Process(); } // Call the kernel function. void add_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z) { add_custom<<<blockDim, l2ctrl, stream>>>(x, y, z); } |
Defining and Calling Kernel Functions
Comply with the following rules when defining kernel functions:
- Use function type qualifiers.
In addition to defining kernel functions in C/C++ function declaration mode, you need to add additional function type qualifiers, including __global__ and __aicore__, to kernel functions.
The __global__ function type qualifier is used to identify a kernel function that can be called by <<<>>>. The __aicore__ function type qualifier is used to specify that the kernel function is executed on the AI Core of the device.
__global__ __aicore__ void kernel_name(argument list);
Functions used in programming can be classified into three types: kernel functions (executed on the device), host functions, and device functions (except kernel functions). The following figure shows the calling relationship between the three types of functions.
- Host functions can call host functions of the same type, similar to function calling in general C/C++ programming. Kernel functions can also be called by <<<>>>.
- Device functions (except kernel functions) can call device functions of the same type.
- Kernel functions can call device functions (except kernel functions).
Figure 1 Calling relationships between kernel functions (executed on the device), host functions, and device functions (except kernel functions)
- Use variable type qualifiers.
The variable type qualifier __gm__ needs to be added to the pointer input parameter variable, indicating that the pointer variable points to a memory address in the global memory.
- Other rules or suggestions
- Rule: The kernel function must have the void return type.
- Rule: The input parameter must be a pointer or C/C++ built-in data type (primitive data types), for example, half* s0, float* s1, and int32_t c.
- Suggestion: To unify the expression, you are advised to use the GM_ADDR macro to modify the input parameter. The definition of the GM_ADDR macro is as follows:
1#define GM_ADDR __gm__ uint8_t*An example of using GM_ADDR to modify the input parameter is as follows:
1extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
The pointer of the uint8_t type is used as an example. In subsequent use, the pointer needs to be converted into the actual pointer type.
The call statement of the kernel function is an extension of the C/C++ function call statement. This section describes only the basic calling modes. During real-world operator development, you can choose more calling modes. For details, see Kernel Launch Operator Development and Project-based Operator Development.
The common function calling mode is as follows:
1
|
function_name(argument list); |
The kernel function uses the kernel launch symbol <<<...>>> to specify the execution configuration of the kernel function.
1
|
kernel_name<<<blockDim, l2ctrl, stream>>>(argument list); |
The kernel launch symbol can be called only when performing compilation on the NPU. It cannot be identified by compilation on the CPU.
- blockDim specifies the number of cores on which a kernel function will be executed. A logical ID, that is, block_idx, is allocated to each core that executes the kernel function, and block_idx can be obtained by calling GetBlockIdx during the kernel function implementation.
blockDim is a concept about logical cores, and its value range is [1, 65535]. To fully utilize hardware resources, set this parameter to the number of physical cores or a multiple of the number of physical cores. For the coupled architecture and separated architecture, the meaning and setting rules of blockDim during running are different. The details are as follows:
- Coupled architecture: Because the Vector and Cube Units are integrated, blockDim is used to start multiple AI Core instances, without differentiating Vector Units and Cube Units. The number of AI Cores can be obtained by calling GetCoreNumAiv or GetCoreNumAic.
- Separated architecture
- For operators that contain only Vector Units, blockDim is used to set the number of vector (AIV) instances to be started. For example, if an AI processor has 40 Vector cores, set blockDim to 40.
- For operators that contain only Cube Units, blockDim is used to set the number of cube (AIC) instances to be started. For example, if an AI processor has 20 Cube cores, set blockDim to 20.
- Operators for Vector/Cube fusion computing are started by groups of AIVs and AICs. blockDim is used to set the number of groups to be started. For example, if an AI processor has 40 Vector cores and 20 Cube cores, a group consists of two Vector cores and one Cube core. Set the number of groups to 20. In this case, 20 groups are started, including 40 Vector cores and 20 Cube cores. Note: In this scenario, the number of blockDim (logical cores) cannot exceed the number of physical cores (a physical core contains two Vector cores and one Cube core).
- The number of AIC and AIV cores can be obtained by calling GetCoreNumAic and GetCoreNumAiv, respectively.
- l2ctrl is a reserved parameter and can be ignored. This parameter is fixed at nullptr.
- stream is of the aclrtStream type. Streams preserve the order of some asynchronous operations being executed on the device. For details about management APIs for stream creation and other operations, see Stream Management.
The following kernel function add_custom is used to add two vectors. The following is a calling sample:
1 2 |
//If blockDim is set to 8, the add_custom kernel function is called on 8 cores. Each core independently executes the kernel function in parallel. The parameter lists of the kernel function are x, y, and z. add_custom<<<8, nullptr, stream>>>(x, y, z); |
The kernel function is called asynchronously. After the kernel function is called, the control right is returned to the host immediately. You can call aclrtSynchronizeStream to force the host program to wait until all kernel functions are executed.
1
|
aclError aclrtSynchronizeStream(aclrtStream stream); |
Defining and Calling the Template Kernel Function
You can use a template to define a kernel function. The following is an example of defining a kernel function, which has two template parameters: a and T. a is a non-type template parameter, while T is a type template parameter.
1 2 3 4 5 6 7 8 9 10 11 |
template<int a, typename T> __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z) { ... AscendC::printf("Print Template a: %d\n", a); ... xGm.SetGlobalBuffer((__gm__T*)x + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH); yGm.SetGlobalBuffer((__gm__T*)y + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH); zGm.SetGlobalBuffer((__gm__T*)z + BLOCK_LENGTH * AscendC::GetBlockIdx(), BLOCK_LENGTH); ... } |
The template kernel function is called as follows: add_custom<20, float>. This part of code calls the kernel function named add_custom and provides specific values for its template parameters.
1
|
add_custom<20, float><<<blockDim, nullptr, stream>>>(x, y, z); |
The template kernel function has the following restrictions:
- Only calling by <<<>>> is supported.
- User-defined data types are not supported.
The following is a negative example. The hello_world kernel function attempts to use the user-defined data type Person, which is not supported.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
struct Person { int age; }; template<int a, typename T> __global__ __aicore__ void hello_world() { AscendC::printf("Hello World!!!\n"); AscendC::printf("template %d\n", a); T x = {30}; AscendC::printf("template T %d\n", x.age); } void hello_world_do(uint32_t blockDim, void *stream) { Person person = {30}; hello_world<300, Person><<<blockDim, nullptr, stream>>>(person); }