Directly Debugging a Kernel Based on a Sample Project

The method of directly debugging a kernel based on a sample project described in this section will not evolve in the future. You are advised to use the command line or compile a CMake file for compilation. For details, see AI Core Operator Compilation.

The following uses the Add vector operator as an example to describe the process of developing a kernel operator.

For more operator sample projects, visit the following link:

Environment Setup

  • Before using the kernel launch operator project, install the driver firmware and CANN by referring to Environment Setup to set up the development environment and operating environment.
  • This operator project requires CMake 3.16 or later. If the version does not meet the requirement, run the following command (an example) to upgrade the CMake to 3.16.0.
    wget https://cmake.org/files/v3.16/cmake-3.16.0.tar.gz 
    tar -zxvf cmake-3.16.0.tar.gz 
    cd cmake-3.16.0
     ./bootstrap --prefix=/usr 
    sudo make 
    sudo make install

Project Directory

Click Vector Operator Sample to obtain a complete sample of kernel function development and runtime verification. Structure of the sample directory:

AddKernelInvocationNeo
|-- cmake                                                 // CMake build file
|-- scripts
|  ├── gen_data.py                                     // Script for generating the input data and truth value
|  ├── verify_result.py                                // Script for checking whether the output data is consistent with the truth value
|-- CMakeLists.txt                                        // CMake build configuration file
|-- add_custom.cpp                                        // Kernel implementation of the vector operator
|-- data_utils.h                                          // Data read and write functions
|-- main.cpp                                              // Main function for calling operators, on the CPU or NPU
|-- run.sh                                                // Script for operator build and running

Operator development procedure based on the operator project:

  • Complete kernel implementation of the operator.
  • Compile the application (main.cpp) that calls the operator.
  • Compile the CMake build configuration file: CMakeLists.txt.

  • Modify the gen_data.py script for generating the input data and true value data based on the site requirements. Modify the verify_result.py script for verifying whether the output data is consistent with the truth value data.
  • Modify the run.sh script for compiling and running the operator as required and execute the script to compile and run the operator and verify the result.

Operator Kernel Implementation

Refer to the kernel implementation of the vector operator, matrix operator, and fusion operator in the project directory to complete the Ascend C operator implementation file.

Only one kernel function can be defined in an operator kernel implementation file.

Operator Calling Program

The following code uses the add_custom operator with a static shape as an example. It describes how to compile the main.cpp file. When implementing your own applications, pay attention to the modifications caused by different operator kernel functions, including different operator kernel function names and input and output parameters. Properly allocate memory, copy memory, and read/write files. You can directly reuse the calling methods of related APIs.

  1. Include header files as required. The ASCENDC_CPU_DEBUG macro is used to distinguish the header files to be included on the CPU/NPU. Note that the header file aclrtlaunch_{kernel_name}.h (automatically generated by the project framework) where the declaration of the corresponding kernel function calling API is located must be included on the NPU. kernel_name indicates the name of the operator kernel function.
    1
    2
    3
    4
    5
    6
    7
    8
    #include "data_utils.h"
    #ifndef ASCENDC_CPU_DEBUG
    #include "acl/acl.h"
    #include "aclrtlaunch_add_custom.h"
    #else
    #include "tikicpulib.h"
    extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z);
    #endif
    
  2. Compile the application framework. The application uses the ASCENDC_CPU_DEBUG macro to determine whether the code logic runs on the CPU or NPU.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    int32_t main(int32_t argc, char* argv[])
    {
        uint32_t blockDim = 8;
        size_t inputByteSize = 8 * 2048 * sizeof(uint16_t);
        size_t outputByteSize = 8 * 2048 * sizeof(uint16_t);
    
    #ifdef ASCENDC_CPU_DEBUG
        // Program for calling CPU debugging APIs
    #else
        // Program for calling the operator on the NPU
    #endif
        return 0;
    }
    
  3. Verify the running on the CPU. To implement the runtime verification of the operator kernel function on the CPU, perform the following steps.
    Figure 1 Runtime verification on the CPU
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
        // Call GmAlloc to allocate shared memory and initialize data.
        uint8_t* x = (uint8_t*)AscendC::GmAlloc(inputByteSize);
        uint8_t* y = (uint8_t*)AscendC::GmAlloc(inputByteSize);
        uint8_t* z = (uint8_t*)AscendC::GmAlloc(outputByteSize);
    
        ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);
        ReadFile("./input/input_y.bin", inputByteSize, y, inputByteSize);
        // Set the kernel mode to AIV for vector operators.
        AscendC::SetKernelMode(KernelMode::AIV_MODE);
        // Call the ICPU_RUN_KF debugging macro to call the kernel function on the CPU.
        ICPU_RUN_KF(add_custom, blockDim, x, y, z);
        // Write output data.
        WriteFile("./output/output_z.bin", z, outputByteSize);
        // Call GmFree to release allocated resources.
        AscendC::GmFree((void *)x);
        AscendC::GmFree((void *)y);
        AscendC::GmFree((void *)z);
    
  4. Verify the running on the NPU. To implement the runtime verification of the operator kernel function on the NPU, perform the following steps.
    Figure 2 Runtime verification on the NPU
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    * Initialization
        CHECK_ACL(aclInit(nullptr));
        // Allocate runtime resources.
        int32_t deviceId = 0;
        CHECK_ACL(aclrtSetDevice(deviceId));
        aclrtStream stream = nullptr;
        CHECK_ACL(aclrtCreateStream(&stream));
        // Allocate the host buffer.
        uint8_t *xHost, *yHost, *zHost;
        uint8_t *xDevice, *yDevice, *zDevice;
    
        CHECK_ACL(aclrtMallocHost((void**)(&xHost), inputByteSize));
        CHECK_ACL(aclrtMallocHost((void**)(&yHost), inputByteSize));
        CHECK_ACL(aclrtMallocHost((void**)(&zHost), outputByteSize));
        // Allocate the device buffer.
        CHECK_ACL(aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
        CHECK_ACL(aclrtMalloc((void**)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
        CHECK_ACL(aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
        // Initialize the host buffer.
        ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);
        ReadFile("./input/input_y.bin", inputByteSize, yHost, inputByteSize);
        // Copy data from the host to the device.
        CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
        CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, yHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
        // Use the kernel launch symbol <<<>>> to call the kernel function to complete the specified operation. <<<>>> call is encapsulated in add_custom_do.
        add_custom_do(blockDim, nullptr, stream, xDevice, yDevice, zDevice);
      // Call ACLRT_LAUNCH_KERNEL to use the kernel function to complete the specified operation.
        // ACLRT_LAUNCH_KERNEL(add_custom)(blockDim, stream, xDevice, yDevice, zDevice);
        CHECK_ACL(aclrtSynchronizeStream(stream));
        // Copy the computation result from the device to the host.
        CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
        WriteFile("./output/output_z.bin", zHost, outputByteSize);
        // Release allocated resources.
        CHECK_ACL(aclrtFree(xDevice));
        CHECK_ACL(aclrtFree(yDevice));
        CHECK_ACL(aclrtFree(zDevice));
        CHECK_ACL(aclrtFreeHost(xHost));
        CHECK_ACL(aclrtFreeHost(yHost));
        CHECK_ACL(aclrtFreeHost(zHost));
        //Perform deinitialization.
        CHECK_ACL(aclrtDestroyStream(stream));
        CHECK_ACL(aclrtResetDevice(deviceId));
        CHECK_ACL(aclFinalize());
    

    The <<<>>> calling method has been described in section Kernel Function. This section describes only the usage of the ACLRT_LAUNCH_KERNEL API.

    1
    ACLRT_LAUNCH_KERNEL(kernel_name)(blockDim, stream, argument list);
    
    • kernel_name: name of the operator kernel function.
    • blockDim: number of cores on which the kernel function will be executed. A logical ID, that is, block_idx, is allocated to each core that executes the kernel function, and block_idx can be obtained by calling GetBlockIdx during the kernel function implementation.
    • stream, aclrtStream type. Streams preserve the order of a stack of asynchronous operations being executed on the device. For details about the stream creation and other management APIs, see section "Stream Management" in the Application Development Guide (C&C++).
    • argument list: list of arguments, which must be the same as the argument list of the kernel function.

Compiling the CMake Build Configuration File

This section describes some key environment variables and CMake command parameters in the CMake file. Generally, you do not need to modify these parameters, but you can use them to better understand compilation principles and facilitate CMake customization.

Table 1 Description of environment variables

Environment Variable

Configuration Description

SOC_VERSION

AI Processor model.

  • For the following products: Run the npu-smi info command on the server where Ascend AI Processor is installed to obtain the Name information. The actual value is AscendName. For example, if Name is xxxyy, the actual value is Ascendxxxyy.

    Atlas A2 training products / Atlas A2 inference products

    Atlas 200I/500 A2 inference products

    Atlas inference products

    Atlas training products

  • For the following products: Run the npu-smi info -t board -i id -c chip_id command on the server where Ascend AI Processor is installed to obtain the Chip Name and NPU Name information. The actual value is Chip Name_NPU Name. For example, if the value of Chip Name is Ascendxxx and the value of NPU Name is 1234, the actual value is Ascendxxx_1234. Note that:
    • id: device ID, which is the NPU ID obtained by running the npu-smi info -l command.
    • chip_id: chip ID, which is obtained by running the npu-smi info -m command.

    Atlas A3 training products / Atlas A3 inference products

ASCEND_CANN_PACKAGE_PATH

Actual installation path of CANN.

CMAKE_BUILD_TYPE

Compilation mode options:

  • Release: release version, which does not contain debugging information. The final release version is compiled.
  • Debug: debug version, which contains debugging information for you to develop and debug.

CMAKE_INSTALL_PREFIX

Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory.

CMAKE_CXX_COMPILER_LAUNCHER

Configures the C++ compiler (such as g++) and BiSheng Compiler launcher as ccache, enabling cache compilation, accelerating repeated compilation, and improving build efficiency. Before using this function, you need to install ccache.

The following shows how to perform configuration in the corresponding CMakeLists.txt file.

set(CMAKE_CXX_COMPILER_LAUNCHER <launcher_program>)

<launcher_program> indicates the ccache installation path, for example, /usr/bin/ccache.

set(CMAKE_CXX_COMPILER_LAUNCHER /usr/bin/ccache)
Table 2 Syntax of the CMake command

CMake Command

Description

add_executable

Adds an executable to the project using the specified source file. The method is the same as that of using common command parameters of CMake.

ascendc_library

Adds a library to a project using the specified kernel function source file. The syntax format is as follows:

ascendc_library(<target_name> [STATIC | SHARED]
            [<source>...]) 

<target_name> indicates the name of the library file. The library file is created based on the source files listed in the command. STATIC and SHARED are used to specify the generated library type. The STATIC library is the archive file of the target file and is used when other targets are linked. The SHARED library is dynamically linked (dynamic link library) and loaded at run time. <source> indicates the source file of the kernel function.

ascendc_fatbin_library

Use the specified kernel source file to compile and generate a kernel binary file, which is used by the kernel loading and execution APIs. The syntax format is as follows:

ascendc_fatbin_library(<target_name> [<source>...]) 
  • <target_name> indicates the name of the library file. The library file is generated by compiling the kernel source file listed in the command and the <target_name>.o file is stored in the ${CMAKE_INSTALL_PREFIX}/fatbin/${target_name}/ directory.
  • <source> indicates the source file of the kernel function.
NOTE:

ascendc_compile_definitions

Adds a compilation macro. You can add custom compilation macros and compilation macros provided by the Ascend C. The syntax format is as follows:

ascendc_compile_definitions(<target_name> [PRIVATE]
            [<xxx>...]) 

Compilation macros provided by the Ascend C:

  • -DASCENDC_DUMP is used to control the dump function. By default, the dump function is enabled. After you call printf/DumpTensor/assert, information is printed. (Note that the host function exists in the kernel file of the launch project. If the printf API is called in the host function, the printf initialization in the kernel will be triggered, affecting the kernel execution performance.) If this parameter is set to 0, the dump function is disabled. The following is an example:
    // Disable the printf printing function of all operators.
    ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE
        ASCENDC_DUMP=0
    )
  • -DASCENDC_DEBUG is used to control the debugging switch of the Ascend C API. By default, the switch is disabled. After this compilation macro is added, the switch is enabled. In this case, the assert verification in the API takes effect. If the verification fails, the assert log is displayed. The printing function affects the actual running performance of the operator. Therefore, this function is usually used in the debugging phase. The following is an example:
    ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE
        ASCENDC_DEBUG
    )

    Currently, the ASCENDC_DEBUG function supports the following product models:

    Atlas inference products

    Atlas A2 training products / Atlas A2 inference products

  • HAVE_WORKSPACE indicates whether the kernel entry contains the workspace input parameter. By default, the kernel entry does not contain the workspace input parameter. After this compilation macro is added, the kernel entry contains the workspace input parameter. In this case, the framework obtains the last parameter (not configured with HAVE_TILING) or the second-to-last parameter (configured with HAVE_TILING) of the kernel input parameter, and automatically sets the system workspace on the kernel side. The workspace obtained by the developer from the kernel input parameter is the user workspace that is offset by the system workspace. When you use high-level APIs that require the system workspace, such as Matmul Kernel APIs, you are advised to enable this function. The argument layout and system workspace setting logic are the same as those of Project-based Operator Development, reducing the modification cost caused by switching between different development modes. Note that developers on the host still need to allocate workspace space. The size of the system workspace can be obtained through the GetLibApiWorkSpaceSize API of PlatformAscendCManager. The following is an example of setting HAVE_WORKSPACE:
    ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE
        HAVE_WORKSPACE
    )
  • HAVE_TILING indicates whether the kernel entry contains the tiling input parameter. This compilation macro takes effect only after HAVE_WORKSPACE is configured. By default, the compilation macro is not included and the switch is disabled. After the compilation macro is added, the compilation macro is included. In this case, the framework considers the last parameter of the kernel argument as tiling and the last but one parameter as workspace. The framework does not process the tiling argument. It only determines the location of the workspace parameter based on the argument. This compilation macro can be used to ensure that the argument is the same as that of Project-based Operator Development, reducing the modification cost caused by switching between different development modes of operator implementation. A configuration example is as follows:
    ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE
        HAVE_WORKSPACE
        HAVE_TILING
    )

ascendc_compile_options

Adds compilation options. You can add compilation options for the compilation on the host and device. The syntax format is as follows:

ascendc_compile_options(<target_name> PRIVATE
    [<xxx>...]
)

By default, the specified compilation options are transferred to the compiler on the device for compilation. To transfer compilation options to the host compiler, use -forward-options-to-host-compiler. The following is an example:

ascendc_compile_options(<target_name> PRIVATE
    -g
    -forward-options-to-host-compiler
    -gdwarf-4
)

As shown in the preceding code, during compilation, the -g compilation option is transferred to the compiler on the device, and the -gdwarf-4 compilation option is transferred to the compiler on the host.

Note: Compilation options on the host support only the compilation options supported by both the G++ and Clang compilers.

ascendc_include_directories

Adds custom search path of the header file. The syntax format is as follows:

ascendc_include_directories(<target_name> [PRIVATE]
            [<xxx>...]) 

The following figure shows the simplified compilation process: Compile the operator kernel function source file to generate a library file (*.so or *.a library file) on the kernel. The project framework automatically generates the header file for the declaration of the kernel function calling API. When compiling main.cpp (operator calling program), the preceding header file is used. The target file generated by the compiled program is linked to the library file on the kernel to generate the final executable file.

Figure 3 Simplified compilation process

After the compilation and installation are finished, the compilation results are generated in the CMAKE_INSTALL_PREFIX directory. The final executable file is generated in the directory where CMake executes the command.

out
├── lib 
│   ├── libkernels1.a
│   ├── libkernels2.so
├── include
│   ├── kernels1
│           ├── aclrtlaunch_matmul_custom.h
│           ├── aclrtlaunch_add_custom.h
│   ├── kernels2
│           ├── aclrtlaunch_xxx.h
│           ├── ...

You can use the msobjdump tool to parse the library file generated in the lib directory to obtain the kernel information. For details, see msobjdump.

Script for Generating and Verifying the Input Data and Truth Value

The following uses the add_custom operator with a static shape as an example to describe how to generate the input data and truth value. Compile the script based on the input and output of the operator to generate the input data and truth value.

#!/usr/bin/python3
# -*- coding:utf-8 -*-
import numpy as np

def gen_golden_data_simple():
    input_x = np.random.uniform(1, 100, [8, 2048]).astype(np.float16)
    input_y = np.random.uniform(1, 100, [8, 2048]).astype(np.float16)
    golden = (input_x + input_y).astype(np.float16)

    input_x.tofile("./input/input_x.bin")
    input_y.tofile("./input/input_y.bin")
    golden.tofile("./output/golden.bin")

if __name__ == "__main__":
    gen_golden_data_simple()

The following is an example of the script for verifying whether the output data is consistent with the truth value. The NumPy API is used to compute the absolute error and relative error between the output data and the truth value. If the deviation is within the tolerable range, the precision meets the requirement and test pass is output.

import os
import sys
import numpy as np

loss = 1e-3 # Tolerable deviation. For fp16, the absolute error and relative error should be less than 1‰.
minimum = 10e-10

def verify_result(real_result, golden):
    real_result = np.fromfile(real_result, dtype=np.float16) # Read the actual operation result from the binary file.
    golden = np.fromfile(golden, dtype=np.float16) # Read the expected operation result from the binary file.
    result = np.abs(real_result - golden) # Compute the deviation between the actual result and the expected result.
    deno = np.maximum(np.abs(real_result), np.abs(golden)) # Obtain the maximum value and form a new array.
    result_atol = np.less_equal(result, loss) # Compute the absolute error.
    result_rtol = np.less_equal(result / np.add(deno, minimum), loss) # Compute the relative error.
    if not result_rtol.all() and not result_atol.all():
        if np.sum(result_rtol == False) > real_result.size * loss and np.sum(result_atol == False) > real_result.size * loss:
            print("[ERROR] result error")
            return False
    print("test pass")
    return True

if __name__ == '__main__':
    verify_result(sys.argv[1],sys.argv[2])

Modifying and Executing the Script for One-Click Compilation and Running

You can refer to the one-click script provided in the sample to quickly compile and run the Ascend C operator on the CPU or NPU. The one-click compilation and running script provide the following functions.

Figure 4 Process of operator compiling and running in one-click mode

The one-click compilation and running script provided in the sample does not apply to all operator runtime verification scenarios. Modify the script based on the actual situation.

  • Compile your script for generating input and truth values based on the algorithm principles of the Ascend C operator.

After compiling the preceding files, you can run script for one-click compilation and running.

The following table describes script parameters and how to execute the script:
bash run.sh --run-mode=npu  --soc-version=<soc_version> --install-path=<install_path>  --build-type=Debug  --install-prefix=<install-prefix>

bash run.sh -r npu  -v <soc_version> -i <install_path> -b Debug -p <install-prefix>
Table 3 Script parameters

Parameter

Abbreviation

Description

--run-mode

-r

The operator runs in CPU or NPU mode.

The value can be cpu or npu.

--soc-version

-v

Model of the AI processor where the operator runs.

NOTE:

The AI processor model <soc_version> can be obtained in the following ways:

  • For the following products: Run the npu-smi info command on the server where Ascend AI Processor is installed to obtain the Name information. The actual value is AscendName. For example, if Name is xxxyy, the actual value is Ascendxxxyy.

    Atlas A2 training products / Atlas A2 inference products

    Atlas 200I/500 A2 inference products

    Atlas inference products

    Atlas training products

  • For the following products: Run the npu-smi info -t board -i id -c chip_id command on the server where Ascend AI Processor is installed to obtain the Chip Name and NPU Name information. The actual value is Chip Name_NPU Name. For example, if the value of Chip Name is Ascendxxx and the value of NPU Name is 1234, the actual value is Ascendxxx_1234. Note that:
    • id: device ID, which is the NPU ID obtained by running the npu-smi info -l command.
    • chip_id: chip ID, which is obtained by running the npu-smi info -m command.

    Atlas A3 training products / Atlas A3 inference products

--install-path

-i

Set this parameter to the CANN installation path. Change it to the actual path.

Default path: $HOME/Ascend/ascend-toolkit/latest

--build-type

-b

Compilation mode options:

  • Release: release version, which does not contain debugging information. The final release version is compiled.
  • Debug: debug version, which contains debugging information for you to develop and debug.

Defaults to Debug.

--install-prefix

-p

Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory.

After the script execution, test pass in the output information indicates that the operator accuracy meets the requirements.