Kernel Launch
A kernel launch operator project is provided to help you quickly complete the kernel launch of operators and facilitate debugging and optimization. You can develop operators based on the sample code and project framework of the operator project. The operator project provides the following functions:
- Supports printf and DumpTensor.
- Supports applications generated by compiling this project. You can run the msProf command to collect and parse profile data. For details about how to use the profiling tool, see Performance Tuning Tool User Guide .
Obtain the operator sample through the following links:
- vector operator sample
- sample of vector operators that support tiling
- cube operator sample
- vector+cube fusion operator sample
The following figure shows the operator development process based on the kernel launch project.

The following uses the Add vector operator as an example to describe the operator project in detail.
Environment Setup
- Before using the kernel launch operator project, install the driver firmware and CANN by referring to Environment Setup to set up the development environment and operating environment.
- This operator project requires CMake 3.16 or later. If the version does not meet the requirement, run the following command (an example) to upgrade the CMake to 3.16.0.
wget https://cmake.org/files/v3.16/cmake-3.16.0.tar.gz --no-check-certificate tar -zxvf cmake-3.16.0.tar.gz cd cmake-3.16.0 ./bootstrap --prefix=/usr sudo make sudo make install
Project Directory
Click vector operator sample to obtain a complete sample of kernel function development and runtime verification. Structure of the sample directory:
AddKernelInvocationNeo |-- cmake // CMake build file |-- scripts | ├── gen_data.py // Script for generating the input data and truth value | ├── verify_result.py // Script for checking whether the output data is consistent with the truth value |-- CMakeLists.txt // CMake build configuration file |-- add_custom.cpp // Kernel implementation of the vector operator |-- data_utils.h // Data read and write functions |-- main.cpp // Main function for calling operators, on the CPU or NPU |-- run.sh // Script for operator build and running
Operator development procedure based on the operator project:
- Complete kernel implementation of the operator.
- Compile the application (main.cpp) that calls the operator.
- Modify the gen_data.py script for generating the input data and true value data based on the site requirements. Modify the verify_result.py script for verifying whether the output data is consistent with the truth value data.
- Modify the run.sh script for compiling and running the operator as required and execute the script to compile and run the operator and verify the result.
Operator Implementation on the Kernel
Compile the Ascend C operator implementation file by referring to Vector Programming and the kernel implementations of the cube and fusion operators in the project directory.
Operator Calling Program
The following code uses the add_custom operator with a static shape as an example. It describes how to compile the main.cpp file. When implementing your own applications, pay attention to the modifications caused by different operator kernel functions, including different operator kernel function names and input and output parameters. Properly allocate memory, copy memory, and read/write files. You can directly reuse the calling methods of related APIs.
- Include header files as required. The ASCENDC_CPU_DEBUG macro is used to distinguish the header files to be included on the CPU/NPU. Note that the header file aclrtlaunch_{kernel_name}.h (automatically generated by the project framework) where the declaration of the corresponding kernel function calling API is located must be included on the NPU. kernel_name indicates the name of the operator kernel function.
1 2 3 4 5 6 7 8
#include "data_utils.h" #ifndef ASCENDC_CPU_DEBUG #include "acl/acl.h" #include "aclrtlaunch_add_custom.h" #else #include "tikicpulib.h" extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z); #endif
- Compile the application framework. The application uses the ASCENDC_CPU_DEBUG macro to determine whether the code logic runs on the CPU or NPU.
1 2 3 4 5 6 7 8 9 10 11 12 13
int32_t main(int32_t argc, char* argv[]) { uint32_t blockDim = 8; size_t inputByteSize = 8 * 2048 * sizeof(uint16_t); size_t outputByteSize = 8 * 2048 * sizeof(uint16_t); #ifdef ASCENDC_CPU_DEBUG // Program for calling CPU debugging APIs #else // Program for calling the operator on the NPU #endif return 0; }
- Verify the running on the CPU. To implement the runtime verification of the operator kernel function on the CPU, perform the following steps.Figure 1 Runtime verification on the CPU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// Call GmAlloc to allocate shared memory and initialize data. uint8_t* x = (uint8_t*)AscendC::GmAlloc(inputByteSize); uint8_t* y = (uint8_t*)AscendC::GmAlloc(inputByteSize); uint8_t* z = (uint8_t*)AscendC::GmAlloc(outputByteSize); ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize); ReadFile("./input/input_y.bin", inputByteSize, y, inputByteSize); // Set the kernel mode to AIV for vector operators. AscendC::SetKernelMode(KernelMode::AIV_MODE); // Call the ICPU_RUN_KF debugging macro to call the kernel function on the CPU. ICPU_RUN_KF(add_custom, blockDim, x, y, z); // Write output data. WriteFile("./output/output_z.bin", z, outputByteSize); // Call GmFree to release allocated resources. AscendC::GmFree((void *)x); AscendC::GmFree((void *)y); AscendC::GmFree((void *)z);
- Verify the running on the NPU. To implement the runtime verification of the operator kernel function on the NPU, perform the following steps.Figure 2 Runtime verification on the NPU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
// Initialize AscendCL. CHECK_ACL(aclInit(nullptr)); // Allocate runtime resources. int32_t deviceId = 0; CHECK_ACL(aclrtSetDevice(deviceId)); aclrtStream stream = nullptr; CHECK_ACL(aclrtCreateStream(&stream)); // Allocate the host buffer. uint8_t *xHost, *yHost, *zHost; uint8_t *xDevice, *yDevice, *zDevice; CHECK_ACL(aclrtMallocHost((void**)(&xHost), inputByteSize)); CHECK_ACL(aclrtMallocHost((void**)(&yHost), inputByteSize)); CHECK_ACL(aclrtMallocHost((void**)(&zHost), outputByteSize)); // Allocate the device buffer. CHECK_ACL(aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMalloc((void**)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); // Initialize the host buffer. ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize); ReadFile("./input/input_y.bin", inputByteSize, yHost, inputByteSize); // Copy data from the host to the device. CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, yHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); // Use ACLRT_LAUNCH_KERNEL to call the kernel function to complete the specified operation. ACLRT_LAUNCH_KERNEL(add_custom)(blockDim, stream, xDevice, yDevice, zDevice); // Use the kernel launch symbol <<<>>> to call the kernel function to complete the specified operation. <<<>>> call is encapsulated in add_custom_do. // add_custom_do(blockDim, nullptr, stream, xDevice, yDevice, zDevice); CHECK_ACL(aclrtSynchronizeStream(stream)); // Copy the computation result from the device to the host. CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST)); WriteFile("./output/output_z.bin", zHost, outputByteSize); // Release allocated resources. CHECK_ACL(aclrtFree(xDevice)); CHECK_ACL(aclrtFree(yDevice)); CHECK_ACL(aclrtFree(zDevice)); CHECK_ACL(aclrtFreeHost(xHost)); CHECK_ACL(aclrtFreeHost(yHost)); CHECK_ACL(aclrtFreeHost(zHost)); // Deinitialize AscendCL. CHECK_ACL(aclrtDestroyStream(stream)); CHECK_ACL(aclrtResetDevice(deviceId)); CHECK_ACL(aclFinalize());
Compiling the CMake Build Configuration File
This section describes some key environment variables and CMake command parameters in the CMake file. Generally, you do not need to modify these parameters, but you can use them to better understand compilation principles and facilitate CMake customization.
|
Environment Variable |
Configuration Description |
|---|---|
|
SOC_VERSION |
AI Processor model.
|
|
ASCEND_CANN_PACKAGE_PATH |
Actual installation path of CANN. |
|
CMAKE_BUILD_TYPE |
Compilation mode options:
|
|
CMAKE_INSTALL_PREFIX |
Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory. |
|
CMAKE_CXX_COMPILER_LAUNCHER |
Configures the C++ compiler (such as g++) and BiSheng Compiler launcher as ccache, enabling cache compilation, accelerating repeated compilation, and improving build efficiency. The following shows how to perform configuration in the corresponding CMakeLists.txt file. set(CMAKE_CXX_COMPILER_LAUNCHER <launcher_program>) <launcher_program> indicates the ccache installation path, for example, /usr/bin/ccache. set(CMAKE_CXX_COMPILER_LAUNCHER /usr/bin/ccache) |
|
CMake Command |
Description |
|---|---|
|
add_executable |
Adds an executable to the project using the specified source file. The method is the same as that of using common command parameters of CMake. |
|
ascendc_library |
Adds a library to a project using the specified kernel function source file. The syntax format is as follows: ascendc_library(<target_name> [STATIC | SHARED]
[<source>...])
<target_name> indicates the name of the library file. The library file is created based on the source files listed in the command. STATIC and SHARED are used to specify the generated library type. The STATIC library is the archive file of the target file and is used when other targets are linked. The SHARED library is dynamically linked (dynamic link library) and loaded at run time. <source> indicates the source file of the kernel function. |
|
ascendc_compile_definitions |
Adds a compilation macro. You can add custom compilation macros and compilation macros provided by the Ascend C. The syntax format is as follows: ascendc_compile_definitions(<target_name> [PRIVATE]
[<xxx>...])
Compilation macros provided by the Ascend C:
|
|
ascendc_compile_options |
Adds compilation options. You can add compilation options for the compilation on the host and device. The syntax format is as follows: ascendc_compile_options(ascendc_kernels_${RUN_MODE} PRIVATE
[<xxx>...]
)
By default, the specified compilation options are passed to the compiler on the device for compilation. To pass compilation options to the host compiler, use -forward-options-to-host-compiler. The following is an example: ascendc_compile_options(ascendc_kernels_${RUN_MODE} PRIVATE
-g
-forward-options-to-host-compiler
-gdwarf-4
)
As shown in the preceding code, during compilation, the -g compilation option is passed to the compiler on the device, and the -gdwarf-4 compilation option is passed to the compiler on the host. Note: Compilation options on the host support only the compilation options supported by both the G++ and Clang compilers. |
|
ascendc_include_directories |
Adds custom search path of the header file. The syntax format is as follows: ascendc_include_directories(<target_name> [PRIVATE]
[<xxx>...]) |
The following figure shows the simplified compilation process: Compile the operator kernel function source file to generate a library file (*.so or *.a library file) on the kernel. The project framework automatically generates the header file for the declaration of the kernel function calling API. When compiling main.cpp (operator calling program), the preceding header file is used. The target file generated by the compiled program is linked to the library file on the kernel to generate the final executable file.
After the compilation and installation are finished, the compilation results are generated in the CMAKE_INSTALL_PREFIX directory. The final executable file is generated in the directory where CMake executes the command.
out ├── lib │ ├── libkernels1.a │ ├── libkernels2.so ├── include │ ├── kernels1 │ ├── aclrtlaunch_matmul_custom.h │ ├── aclrtlaunch_add_custom.h │ ├── kernels2 │ ├── aclrtlaunch_xxx.h │ ├── ...
You can use the msobjdump tool to parse the library file generated in the lib directory to obtain the kernel information. For details, see msobjdump.
Script for Generating and Verifying the Input Data and Truth Value
The following uses the add_custom operator with a static shape as an example to describe how to generate the input data and truth value. Compile the script based on the input and output of the operator to generate the input data and truth value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#!/usr/bin/python3 # -*- coding:utf-8 -*- # Copyright 2022-2023 Huawei Technologies Co., Ltd import numpy as np def gen_golden_data_simple(): input_x = np.random.uniform(1, 100, [8, 2048]).astype(np.float16) input_y = np.random.uniform(1, 100, [8, 2048]).astype(np.float16) golden = (input_x + input_y).astype(np.float16) input_x.tofile("./input/input_x.bin") input_y.tofile("./input/input_y.bin") golden.tofile("./output/golden.bin") if __name__ == "__main__": gen_golden_data_simple() |
The following is an example of the script for verifying whether the output data is consistent with the truth value. The NumPy API is used to compute the absolute error and relative error between the output data and the truth value. If the deviation is within the tolerable range, the precision meets the requirement and test pass is output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import os import sys import numpy as np loss = 1e-3 # Tolerable deviation. For fp16, the absolute error and relative error should be less than 1‰. minimum = 10e-10 def verify_result(real_result, golden): real_result = np.fromfile(real_result, dtype=np.float16) # Read the actual operation result from the binary file. golden = np.fromfile(golden, dtype=np.float16) # Read the expected operation result from the binary file. result = np.abs(real_result - golden) # Compute the deviation between the actual result and the expected result. deno = np.maximum(np.abs(real_result), np.abs(golden)) # Obtain the maximum value and form a new array. result_atol = np.less_equal(result, loss) # Compute the absolute error. result_rtol = np.less_equal(result / np.add(deno, minimum), loss) # Compute the relative error. if not result_rtol.all() and not result_atol.all(): if np.sum(result_rtol == False) > real_result.size * loss and np.sum(result_atol == False) > real_result.size * loss: print("[ERROR] result error") return False print("test pass") return True if __name__ == '__main__': verify_result(sys.argv[1],sys.argv[2]) |
Modifying and Executing the Script for One-Click Compilation and Running
You can refer to the one-click script provided in the sample to quickly compile and run the Ascend C operator on the CPU or NPU. The one-click compilation and running script provide the following functions.
The one-click compilation and running script provided in the sample does not apply to all operator runtime verification scenarios. Modify the script based on the actual situation.
- Compile your script for generating input and truth values based on the algorithm principles of the Ascend C operator.
After compiling the preceding files, you can run script for one-click compilation and running.
bash run.sh --run-mode=npu --soc-version=<soc_version> --install-path=<install_path> --build-type=Debug --install-prefix=<install-prefix> bash run.sh -r npu -v <soc_version> -i <install_path> -b Debug -p <install-prefix>
|
Parameter |
Abbreviation |
Description |
|---|---|---|
|
--run-mode |
-r |
The operator runs in CPU or NPU mode. The value can be cpu or npu. The default value is npu. |
|
--soc-version |
-v |
Model of the AI processor where the operator runs.
NOTE:
The AI processor model <soc_version> can be obtained in the following ways:
The following models are supported:
|
|
--install-path |
-i |
Set this parameter to the CANN installation path. Change it to the actual path. Default path: $HOME/Ascend/ascend-toolkit/latest |
|
--build-type |
-b |
Compilation mode options:
Defaults to Debug. |
|
--install-prefix |
-p |
Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory. |
After the script execution, test pass in the output information indicates that the operator accuracy meets the requirements.