Directly Debugging a Kernel Based on a Sample Project
The method of directly debugging a kernel based on a sample project described in this section will not evolve in the future. You are advised to use the command line or compile a CMake file for compilation. For details, see AI Core Operator Compilation.
The following uses the Add vector operator as an example to describe the process of developing a kernel operator.
For more operator sample projects, visit the following link:
- Vector Operator Sample
- sample of vector operators that support tiling
- cube operator sample
- vector+cube fusion operator sample
Environment Setup
- Before using the kernel launch operator project, install the driver firmware and CANN by referring to Environment Setup to set up the development environment and operating environment.
- This operator project requires CMake 3.16 or later. If the version does not meet the requirement, run the following command (an example) to upgrade the CMake to 3.16.0.
wget https://cmake.org/files/v3.16/cmake-3.16.0.tar.gz tar -zxvf cmake-3.16.0.tar.gz cd cmake-3.16.0 ./bootstrap --prefix=/usr sudo make sudo make install
Project Directory
Click Vector Operator Sample to obtain a complete sample of kernel function development and runtime verification. Structure of the sample directory:
AddKernelInvocationNeo |-- cmake // CMake build file |-- scripts | ├── gen_data.py // Script for generating the input data and truth value | ├── verify_result.py // Script for checking whether the output data is consistent with the truth value |-- CMakeLists.txt // CMake build configuration file |-- add_custom.cpp // Kernel implementation of the vector operator |-- data_utils.h // Data read and write functions |-- main.cpp // Main function for calling operators, on the CPU or NPU |-- run.sh // Script for operator build and running
Operator development procedure based on the operator project:
- Complete kernel implementation of the operator.
- Compile the application (main.cpp) that calls the operator.
- Modify the gen_data.py script for generating the input data and true value data based on the site requirements. Modify the verify_result.py script for verifying whether the output data is consistent with the truth value data.
- Modify the run.sh script for compiling and running the operator as required and execute the script to compile and run the operator and verify the result.
Operator Kernel Implementation
Refer to the kernel implementation of the vector operator, matrix operator, and fusion operator in the project directory to complete the Ascend C operator implementation file.
Only one kernel function can be defined in an operator kernel implementation file.
Operator Calling Program
The following code uses the add_custom operator with a static shape as an example. It describes how to compile the main.cpp file. When implementing your own applications, pay attention to the modifications caused by different operator kernel functions, including different operator kernel function names and input and output parameters. Properly allocate memory, copy memory, and read/write files. You can directly reuse the calling methods of related APIs.
- Include header files as required. The ASCENDC_CPU_DEBUG macro is used to distinguish the header files to be included on the CPU/NPU. Note that the header file aclrtlaunch_{kernel_name}.h (automatically generated by the project framework) where the declaration of the corresponding kernel function calling API is located must be included on the NPU. kernel_name indicates the name of the operator kernel function.
1 2 3 4 5 6 7 8
#include "data_utils.h" #ifndef ASCENDC_CPU_DEBUG #include "acl/acl.h" #include "aclrtlaunch_add_custom.h" #else #include "tikicpulib.h" extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z); #endif
- Compile the application framework. The application uses the ASCENDC_CPU_DEBUG macro to determine whether the code logic runs on the CPU or NPU.
1 2 3 4 5 6 7 8 9 10 11 12 13
int32_t main(int32_t argc, char* argv[]) { uint32_t blockDim = 8; size_t inputByteSize = 8 * 2048 * sizeof(uint16_t); size_t outputByteSize = 8 * 2048 * sizeof(uint16_t); #ifdef ASCENDC_CPU_DEBUG // Program for calling CPU debugging APIs #else // Program for calling the operator on the NPU #endif return 0; }
- Verify the running on the CPU. To implement the runtime verification of the operator kernel function on the CPU, perform the following steps.Figure 1 Runtime verification on the CPU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// Call GmAlloc to allocate shared memory and initialize data. uint8_t* x = (uint8_t*)AscendC::GmAlloc(inputByteSize); uint8_t* y = (uint8_t*)AscendC::GmAlloc(inputByteSize); uint8_t* z = (uint8_t*)AscendC::GmAlloc(outputByteSize); ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize); ReadFile("./input/input_y.bin", inputByteSize, y, inputByteSize); // Set the kernel mode to AIV for vector operators. AscendC::SetKernelMode(KernelMode::AIV_MODE); // Call the ICPU_RUN_KF debugging macro to call the kernel function on the CPU. ICPU_RUN_KF(add_custom, blockDim, x, y, z); // Write output data. WriteFile("./output/output_z.bin", z, outputByteSize); // Call GmFree to release allocated resources. AscendC::GmFree((void *)x); AscendC::GmFree((void *)y); AscendC::GmFree((void *)z);
- Verify the running on the NPU. To implement the runtime verification of the operator kernel function on the NPU, perform the following steps.Figure 2 Runtime verification on the NPU
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
* Initialization CHECK_ACL(aclInit(nullptr)); // Allocate runtime resources. int32_t deviceId = 0; CHECK_ACL(aclrtSetDevice(deviceId)); aclrtStream stream = nullptr; CHECK_ACL(aclrtCreateStream(&stream)); // Allocate the host buffer. uint8_t *xHost, *yHost, *zHost; uint8_t *xDevice, *yDevice, *zDevice; CHECK_ACL(aclrtMallocHost((void**)(&xHost), inputByteSize)); CHECK_ACL(aclrtMallocHost((void**)(&yHost), inputByteSize)); CHECK_ACL(aclrtMallocHost((void**)(&zHost), outputByteSize)); // Allocate the device buffer. CHECK_ACL(aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMalloc((void**)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); CHECK_ACL(aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST)); // Initialize the host buffer. ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize); ReadFile("./input/input_y.bin", inputByteSize, yHost, inputByteSize); // Copy data from the host to the device. CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, yHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); // Use the kernel launch symbol <<<>>> to call the kernel function to complete the specified operation. <<<>>> call is encapsulated in add_custom_do. add_custom_do(blockDim, nullptr, stream, xDevice, yDevice, zDevice); // Call ACLRT_LAUNCH_KERNEL to use the kernel function to complete the specified operation. // ACLRT_LAUNCH_KERNEL(add_custom)(blockDim, stream, xDevice, yDevice, zDevice); CHECK_ACL(aclrtSynchronizeStream(stream)); // Copy the computation result from the device to the host. CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST)); WriteFile("./output/output_z.bin", zHost, outputByteSize); // Release allocated resources. CHECK_ACL(aclrtFree(xDevice)); CHECK_ACL(aclrtFree(yDevice)); CHECK_ACL(aclrtFree(zDevice)); CHECK_ACL(aclrtFreeHost(xHost)); CHECK_ACL(aclrtFreeHost(yHost)); CHECK_ACL(aclrtFreeHost(zHost)); //Perform deinitialization. CHECK_ACL(aclrtDestroyStream(stream)); CHECK_ACL(aclrtResetDevice(deviceId)); CHECK_ACL(aclFinalize());
The <<<>>> calling method has been described in section Kernel Function. This section describes only the usage of the ACLRT_LAUNCH_KERNEL API.
1ACLRT_LAUNCH_KERNEL(kernel_name)(blockDim, stream, argument list);
- kernel_name: name of the operator kernel function.
- blockDim: number of cores on which the kernel function will be executed. A logical ID, that is, block_idx, is allocated to each core that executes the kernel function, and block_idx can be obtained by calling GetBlockIdx during the kernel function implementation.
- stream, aclrtStream type. Streams preserve the order of a stack of asynchronous operations being executed on the device. For details about the stream creation and other management APIs, see section "Stream Management" in the Application Development Guide (C&C++).
- argument list: list of arguments, which must be the same as the argument list of the kernel function.
Compiling the CMake Build Configuration File
This section describes some key environment variables and CMake command parameters in the CMake file. Generally, you do not need to modify these parameters, but you can use them to better understand compilation principles and facilitate CMake customization.
|
Environment Variable |
Configuration Description |
|---|---|
|
SOC_VERSION |
AI Processor model.
|
|
ASCEND_CANN_PACKAGE_PATH |
Actual installation path of CANN. |
|
CMAKE_BUILD_TYPE |
Compilation mode options:
|
|
CMAKE_INSTALL_PREFIX |
Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory. |
|
CMAKE_CXX_COMPILER_LAUNCHER |
Configures the C++ compiler (such as g++) and BiSheng Compiler launcher as ccache, enabling cache compilation, accelerating repeated compilation, and improving build efficiency. Before using this function, you need to install ccache. The following shows how to perform configuration in the corresponding CMakeLists.txt file. set(CMAKE_CXX_COMPILER_LAUNCHER <launcher_program>) <launcher_program> indicates the ccache installation path, for example, /usr/bin/ccache. set(CMAKE_CXX_COMPILER_LAUNCHER /usr/bin/ccache) |
|
CMake Command |
Description |
|---|---|
|
add_executable |
Adds an executable to the project using the specified source file. The method is the same as that of using common command parameters of CMake. |
|
ascendc_library |
Adds a library to a project using the specified kernel function source file. The syntax format is as follows: ascendc_library(<target_name> [STATIC | SHARED]
[<source>...])
<target_name> indicates the name of the library file. The library file is created based on the source files listed in the command. STATIC and SHARED are used to specify the generated library type. The STATIC library is the archive file of the target file and is used when other targets are linked. The SHARED library is dynamically linked (dynamic link library) and loaded at run time. <source> indicates the source file of the kernel function. |
|
ascendc_fatbin_library |
Use the specified kernel source file to compile and generate a kernel binary file, which is used by the kernel loading and execution APIs. The syntax format is as follows: ascendc_fatbin_library(<target_name> [<source>...])
NOTE:
|
|
ascendc_compile_definitions |
Adds a compilation macro. You can add custom compilation macros and compilation macros provided by the Ascend C. The syntax format is as follows: ascendc_compile_definitions(<target_name> [PRIVATE]
[<xxx>...])
Compilation macros provided by the Ascend C:
|
|
ascendc_compile_options |
Adds compilation options. You can add compilation options for the compilation on the host and device. The syntax format is as follows: ascendc_compile_options(<target_name> PRIVATE
[<xxx>...]
)
By default, the specified compilation options are transferred to the compiler on the device for compilation. To transfer compilation options to the host compiler, use -forward-options-to-host-compiler. The following is an example: ascendc_compile_options(<target_name> PRIVATE
-g
-forward-options-to-host-compiler
-gdwarf-4
)
As shown in the preceding code, during compilation, the -g compilation option is transferred to the compiler on the device, and the -gdwarf-4 compilation option is transferred to the compiler on the host. Note: Compilation options on the host support only the compilation options supported by both the G++ and Clang compilers. |
|
ascendc_include_directories |
Adds custom search path of the header file. The syntax format is as follows: ascendc_include_directories(<target_name> [PRIVATE]
[<xxx>...]) |
The following figure shows the simplified compilation process: Compile the operator kernel function source file to generate a library file (*.so or *.a library file) on the kernel. The project framework automatically generates the header file for the declaration of the kernel function calling API. When compiling main.cpp (operator calling program), the preceding header file is used. The target file generated by the compiled program is linked to the library file on the kernel to generate the final executable file.
After the compilation and installation are finished, the compilation results are generated in the CMAKE_INSTALL_PREFIX directory. The final executable file is generated in the directory where CMake executes the command.
out ├── lib │ ├── libkernels1.a │ ├── libkernels2.so ├── include │ ├── kernels1 │ ├── aclrtlaunch_matmul_custom.h │ ├── aclrtlaunch_add_custom.h │ ├── kernels2 │ ├── aclrtlaunch_xxx.h │ ├── ...
You can use the msobjdump tool to parse the library file generated in the lib directory to obtain the kernel information. For details, see msobjdump.
Script for Generating and Verifying the Input Data and Truth Value
The following uses the add_custom operator with a static shape as an example to describe how to generate the input data and truth value. Compile the script based on the input and output of the operator to generate the input data and truth value.
#!/usr/bin/python3
# -*- coding:utf-8 -*-
import numpy as np
def gen_golden_data_simple():
input_x = np.random.uniform(1, 100, [8, 2048]).astype(np.float16)
input_y = np.random.uniform(1, 100, [8, 2048]).astype(np.float16)
golden = (input_x + input_y).astype(np.float16)
input_x.tofile("./input/input_x.bin")
input_y.tofile("./input/input_y.bin")
golden.tofile("./output/golden.bin")
if __name__ == "__main__":
gen_golden_data_simple()
The following is an example of the script for verifying whether the output data is consistent with the truth value. The NumPy API is used to compute the absolute error and relative error between the output data and the truth value. If the deviation is within the tolerable range, the precision meets the requirement and test pass is output.
import os
import sys
import numpy as np
loss = 1e-3 # Tolerable deviation. For fp16, the absolute error and relative error should be less than 1‰.
minimum = 10e-10
def verify_result(real_result, golden):
real_result = np.fromfile(real_result, dtype=np.float16) # Read the actual operation result from the binary file.
golden = np.fromfile(golden, dtype=np.float16) # Read the expected operation result from the binary file.
result = np.abs(real_result - golden) # Compute the deviation between the actual result and the expected result.
deno = np.maximum(np.abs(real_result), np.abs(golden)) # Obtain the maximum value and form a new array.
result_atol = np.less_equal(result, loss) # Compute the absolute error.
result_rtol = np.less_equal(result / np.add(deno, minimum), loss) # Compute the relative error.
if not result_rtol.all() and not result_atol.all():
if np.sum(result_rtol == False) > real_result.size * loss and np.sum(result_atol == False) > real_result.size * loss:
print("[ERROR] result error")
return False
print("test pass")
return True
if __name__ == '__main__':
verify_result(sys.argv[1],sys.argv[2])
Modifying and Executing the Script for One-Click Compilation and Running
You can refer to the one-click script provided in the sample to quickly compile and run the Ascend C operator on the CPU or NPU. The one-click compilation and running script provide the following functions.
The one-click compilation and running script provided in the sample does not apply to all operator runtime verification scenarios. Modify the script based on the actual situation.
- Compile your script for generating input and truth values based on the algorithm principles of the Ascend C operator.
After compiling the preceding files, you can run script for one-click compilation and running.
bash run.sh --run-mode=npu --soc-version=<soc_version> --install-path=<install_path> --build-type=Debug --install-prefix=<install-prefix> bash run.sh -r npu -v <soc_version> -i <install_path> -b Debug -p <install-prefix>
|
Parameter |
Abbreviation |
Description |
|---|---|---|
|
--run-mode |
-r |
The operator runs in CPU or NPU mode. The value can be cpu or npu. |
|
--soc-version |
-v |
Model of the AI processor where the operator runs.
NOTE:
The AI processor model <soc_version> can be obtained in the following ways:
|
|
--install-path |
-i |
Set this parameter to the CANN installation path. Change it to the actual path. Default path: $HOME/Ascend/ascend-toolkit/latest |
|
--build-type |
-b |
Compilation mode options:
Defaults to Debug. |
|
--install-prefix |
-p |
Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory. |
After the script execution, test pass in the output information indicates that the operator accuracy meets the requirements.