Kernel Launch

A kernel launch operator project is provided to help you quickly complete the kernel launch of operators and facilitate debugging and optimization. You can develop operators based on the sample code and project framework of the operator project. The operator project provides the following functions:

Supports printf and DumpTensor.
Supports applications generated by compiling this project. You can run the msProf command to collect and parse profile data. For details about how to use the profiling tool, see Performance Tuning Tool User Guide .

Obtain the operator sample through the following links:

The following figure shows the operator development process based on the kernel launch project.

The following uses the Add vector operator as an example to describe the operator project in detail.

Environment Setup

Before using the kernel launch operator project, install the driver firmware and CANN by referring to Environment Setup to set up the development environment and operating environment.

This operator project requires CMake 3.16 or later. If the version does not meet the requirement, run the following command (an example) to upgrade the CMake to 3.16.0.

wget https://cmake.org/files/v3.16/cmake-3.16.0.tar.gz --no-check-certificate 
tar -zxvf cmake-3.16.0.tar.gz 
cd cmake-3.16.0
 ./bootstrap --prefix=/usr 
sudo make 
sudo make install

Project Directory

Click vector operator sample to obtain a complete sample of kernel function development and runtime verification. Structure of the sample directory:

AddKernelInvocationNeo
|-- cmake                                                 // CMake build file
|-- scripts
|  ├── gen_data.py                                     // Script for generating the input data and truth value
|  ├── verify_result.py                                // Script for checking whether the output data is consistent with the truth value
|-- CMakeLists.txt                                        // CMake build configuration file
|-- add_custom.cpp                                        // Kernel implementation of the vector operator
|-- data_utils.h                                          // Data read and write functions
|-- main.cpp                                              // Main function for calling operators, on the CPU or NPU
|-- run.sh                                                // Script for operator build and running

Operator development procedure based on the operator project:

Complete kernel implementation of the operator.
Compile the application (main.cpp) that calls the operator.
Compile the CMake build configuration file: CMakeLists.txt.
Modify the gen_data.py script for generating the input data and true value data based on the site requirements. Modify the verify_result.py script for verifying whether the output data is consistent with the truth value data.
Modify the run.sh script for compiling and running the operator as required and execute the script to compile and run the operator and verify the result.

Operator Implementation on the Kernel

Compile the Ascend C operator implementation file by referring to Vector Programming and the kernel implementations of the cube and fusion operators in the project directory.

Operator Calling Program

The following code uses the add_custom operator with a static shape as an example. It describes how to compile the main.cpp file. When implementing your own applications, pay attention to the modifications caused by different operator kernel functions, including different operator kernel function names and input and output parameters. Properly allocate memory, copy memory, and read/write files. You can directly reuse the calling methods of related APIs.

Include header files as required. The ASCENDC_CPU_DEBUG macro is used to distinguish the header files to be included on the CPU/NPU. Note that the header file aclrtlaunch_{kernel_name}.h (automatically generated by the project framework) where the declaration of the corresponding kernel function calling API is located must be included on the NPU. kernel_name indicates the name of the operator kernel function.

        
             #include "data_utils.h"
#ifndef ASCENDC_CPU_DEBUG
#include "acl/acl.h"
#include "aclrtlaunch_add_custom.h"
#else
#include "tikicpulib.h"
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z);
#endif

Compile the application framework. The application uses the ASCENDC_CPU_DEBUG macro to determine whether the code logic runs on the CPU or NPU.

        
             int32_t main(int32_t argc, char* argv[])
{
    uint32_t blockDim = 8;
    size_t inputByteSize = 8 * 2048 * sizeof(uint16_t);
    size_t outputByteSize = 8 * 2048 * sizeof(uint16_t);

#ifdef ASCENDC_CPU_DEBUG
    // Program for calling CPU debugging APIs
#else
    // Program for calling the operator on the NPU
#endif
    return 0;
}

Verify the running on the CPU. To implement the runtime verification of the operator kernel function on the CPU, perform the following steps.

Figure 1 Runtime verification on the CPU

        
                 // Call GmAlloc to allocate shared memory and initialize data.
    uint8_t* x = (uint8_t*)AscendC::GmAlloc(inputByteSize);
    uint8_t* y = (uint8_t*)AscendC::GmAlloc(inputByteSize);
    uint8_t* z = (uint8_t*)AscendC::GmAlloc(outputByteSize);

    ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);
    ReadFile("./input/input_y.bin", inputByteSize, y, inputByteSize);
    // Set the kernel mode to AIV for vector operators.
    AscendC::SetKernelMode(KernelMode::AIV_MODE);
    // Call the ICPU_RUN_KF debugging macro to call the kernel function on the CPU.
    ICPU_RUN_KF(add_custom, blockDim, x, y, z);
    // Write output data.
    WriteFile("./output/output_z.bin", z, outputByteSize);
    // Call GmFree to release allocated resources.
    AscendC::GmFree((void *)x);
    AscendC::GmFree((void *)y);
    AscendC::GmFree((void *)z);

Verify the running on the NPU. To implement the runtime verification of the operator kernel function on the NPU, perform the following steps.

Figure 2 Runtime verification on the NPU

         
          
            
            
                  // Initialize AscendCL.
    CHECK_ACL(aclInit(nullptr));
    // Allocate runtime resources.
    int32_t deviceId = 0;
    CHECK_ACL(aclrtSetDevice(deviceId));
    aclrtStream stream = nullptr;
    CHECK_ACL(aclrtCreateStream(&stream));
    // Allocate the host buffer.
    uint8_t *xHost, *yHost, *zHost;
    uint8_t *xDevice, *yDevice, *zDevice;

    CHECK_ACL(aclrtMallocHost((void**)(&xHost), inputByteSize));
    CHECK_ACL(aclrtMallocHost((void**)(&yHost), inputByteSize));
    CHECK_ACL(aclrtMallocHost((void**)(&zHost), outputByteSize));
    // Allocate the device buffer.
    CHECK_ACL(aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
    CHECK_ACL(aclrtMalloc((void**)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
    CHECK_ACL(aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
    // Initialize the host buffer.
    ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);
    ReadFile("./input/input_y.bin", inputByteSize, yHost, inputByteSize);
    // Copy data from the host to the device.
    CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
    CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, yHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
    // Use ACLRT_LAUNCH_KERNEL to call the kernel function to complete the specified operation.
    ACLRT_LAUNCH_KERNEL(add_custom)(blockDim, stream, xDevice, yDevice, zDevice);
    // Use the kernel launch symbol <<<>>> to call the kernel function to complete the specified operation. <<<>>> call is encapsulated in add_custom_do.
   // add_custom_do(blockDim, nullptr, stream, xDevice, yDevice, zDevice);
    CHECK_ACL(aclrtSynchronizeStream(stream));
    // Copy the computation result from the device to the host.
    CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
    WriteFile("./output/output_z.bin", zHost, outputByteSize);
    // Release allocated resources.
    CHECK_ACL(aclrtFree(xDevice));
    CHECK_ACL(aclrtFree(yDevice));
    CHECK_ACL(aclrtFree(zDevice));
    CHECK_ACL(aclrtFreeHost(xHost));
    CHECK_ACL(aclrtFreeHost(yHost));
    CHECK_ACL(aclrtFreeHost(zHost));
    // Deinitialize AscendCL.
    CHECK_ACL(aclrtDestroyStream(stream));
    CHECK_ACL(aclrtResetDevice(deviceId));
    CHECK_ACL(aclFinalize());

             

           

         
        

Compiling the CMake Build Configuration File

This section describes some key environment variables and CMake command parameters in the CMake file. Generally, you do not need to modify these parameters, but you can use them to better understand compilation principles and facilitate CMake customization.

**Table 1** Description of environment variables
Environment Variable	Configuration Description
SOC_VERSION	AI Processor model. Run the npu-smi info command on the server where the Ascend AI Processor is installed to obtain the Chip Name information. The actual value is AscendChip Name. For example, if Chip Name is xxxyy, the actual value is Ascendxxxyy.
ASCEND_CANN_PACKAGE_PATH	Actual installation path of CANN.
CMAKE_BUILD_TYPE	Compilation mode options: Release: release version, which does not contain debugging information. The final release version is compiled. Debug: debug version, which contains debugging information for you to develop and debug.
CMAKE_INSTALL_PREFIX	Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory.
CMAKE_CXX_COMPILER_LAUNCHER	Configures the C++ compiler (such as g++) and BiSheng Compiler launcher as ccache, enabling cache compilation, accelerating repeated compilation, and improving build efficiency. The following shows how to perform configuration in the corresponding CMakeLists.txt file. set(CMAKE_CXX_COMPILER_LAUNCHER <launcher_program>) <launcher_program> indicates the ccache installation path, for example, /usr/bin/ccache. set(CMAKE_CXX_COMPILER_LAUNCHER /usr/bin/ccache)

**Table 2** Syntax of the CMake command
CMake Command	Description
add_executable	Adds an executable to the project using the specified source file. The method is the same as that of using common command parameters of CMake.
ascendc_library	Adds a library to a project using the specified kernel function source file. The syntax format is as follows: ascendc_library(<target_name> [STATIC \| SHARED] [<source>...]) <target_name> indicates the name of the library file. The library file is created based on the source files listed in the command. STATIC and SHARED are used to specify the generated library type. The STATIC library is the archive file of the target file and is used when other targets are linked. The SHARED library is dynamically linked (dynamic link library) and loaded at run time. <source> indicates the source file of the kernel function.
ascendc_compile_definitions	Adds a compilation macro. You can add custom compilation macros and compilation macros provided by the Ascend C. The syntax format is as follows: ascendc_compile_definitions(<target_name> [PRIVATE] [<xxx>...]) Compilation macros provided by the Ascend C: -DASCENDC_DUMP is used to control the dump function. By default, the dump function is enabled. After you call printf/DumpTensor/assert, information is printed. (Note that the host function exists in the kernel file of the launch project. If the printf API is called in the host function, the printf initialization in the kernel will be triggered, affecting the kernel execution performance.) If this parameter is set to 0, the dump function is disabled. The following is an example: // Disable the printf printing function of all operators. ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE -DASCENDC_DUMP=0 ) -DASCENDC_DEBUG is used to control the debugging switch of the Ascend C API. By default, the switch is disabled. After this compilation macro is added, the switch is enabled. In this case, the assert verification in the API takes effect. If the verification fails, the assert log is displayed. The printing function affects the actual running performance of the operator. Therefore, this function is usually used in the debugging phase. The following is an example: ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE -DASCENDC_DEBUG ) The -DASCENDC_DEBUG function supports the following product models: -DHAVE_WORKSPACE indicates whether the kernel entry includes the workspace argument. The workspace is not included by default. It is included after the compilation macro is added. In this case, the framework obtains the last parameter (-DHAVE_TILING is not configured) or the last but one parameter (-DHAVE_TILING is configured) of the kernel argument and automatically sets the system workspace on the kernel. The workspace obtained from the argument on the kernel is the user workspace after the system workspace is offset. When you use high-level APIs that require the system workspace, such as Matmul, you are advised to enable this function. The argument layout and system workspace setting logic are the same as those of Project-based Operator Development, reducing the modification cost caused by switching between different development modes. Note that developers on the host still need to allocate workspace space. The size of the system workspace can be obtained through the GetLibApiWorkSpaceSize API of PlatformAscendCManager. A configuration example of -DHAVE_WORKSPACE is as follows: ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE -DHAVE_WORKSPACE ) -DHAVE_TILING indicates whether the kernel entry contains tiling arguments. The compilation macro takes effect only after -DHAVE_WORKSPACE is configured. By default, the compilation macro is not included and the switch is disabled. After the compilation macro is added, the compilation macro is included. In this case, the framework considers the last parameter of the kernel argument as tiling and the last but one parameter as workspace. The framework does not process the tiling argument. It only determines the location of the workspace parameter based on the argument. This compilation macro can be used to ensure that the argument is the same as that of Project-based Operator Development, reducing the modification cost caused by switching between different development modes of operator implementation. A configuration example is as follows: ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE -DHAVE_WORKSPACE -DHAVE_TILING )
ascendc_compile_options	Adds compilation options. You can add compilation options for the compilation on the host and device. The syntax format is as follows: ascendc_compile_options(ascendc_kernels_${RUN_MODE} PRIVATE [<xxx>...] ) By default, the specified compilation options are passed to the compiler on the device for compilation. To pass compilation options to the host compiler, use -forward-options-to-host-compiler. The following is an example: ascendc_compile_options(ascendc_kernels_${RUN_MODE} PRIVATE -g -forward-options-to-host-compiler -gdwarf-4 ) As shown in the preceding code, during compilation, the -g compilation option is passed to the compiler on the device, and the -gdwarf-4 compilation option is passed to the compiler on the host. Note: Compilation options on the host support only the compilation options supported by both the G++ and Clang compilers.
ascendc_include_directories	Adds custom search path of the header file. The syntax format is as follows: ascendc_include_directories(<target_name> [PRIVATE] [<xxx>...])

The following figure shows the simplified compilation process: Compile the operator kernel function source file to generate a library file (*.so or *.a library file) on the kernel. The project framework automatically generates the header file for the declaration of the kernel function calling API. When compiling main.cpp (operator calling program), the preceding header file is used. The target file generated by the compiled program is linked to the library file on the kernel to generate the final executable file.

Figure 3 Simplified compilation process

After the compilation and installation are finished, the compilation results are generated in the CMAKE_INSTALL_PREFIX directory. The final executable file is generated in the directory where CMake executes the command.

out
├── lib 
│   ├── libkernels1.a
│   ├── libkernels2.so
├── include
│   ├── kernels1
│           ├── aclrtlaunch_matmul_custom.h
│           ├── aclrtlaunch_add_custom.h
│   ├── kernels2
│           ├── aclrtlaunch_xxx.h
│           ├── ...

You can use the msobjdump tool to parse the library file generated in the lib directory to obtain the kernel information. For details, see msobjdump.

Script for Generating and Verifying the Input Data and Truth Value

The following uses the add_custom operator with a static shape as an example to describe how to generate the input data and truth value. Compile the script based on the input and output of the operator to generate the input data and truth value.

      
           #!/usr/bin/python3
# -*- coding:utf-8 -*-
# Copyright 2022-2023 Huawei Technologies Co., Ltd
import numpy as np

def gen_golden_data_simple():
    input_x = np.random.uniform(1, 100, [8, 2048]).astype(np.float16)
    input_y = np.random.uniform(1, 100, [8, 2048]).astype(np.float16)
    golden = (input_x + input_y).astype(np.float16)

    input_x.tofile("./input/input_x.bin")
    input_y.tofile("./input/input_y.bin")
    golden.tofile("./output/golden.bin")

if __name__ == "__main__":
    gen_golden_data_simple()

The following is an example of the script for verifying whether the output data is consistent with the truth value. The NumPy API is used to compute the absolute error and relative error between the output data and the truth value. If the deviation is within the tolerable range, the precision meets the requirement and test pass is output.

      
           import os
import sys
import numpy as np

loss = 1e-3 # Tolerable deviation. For fp16, the absolute error and relative error should be less than 1‰.
minimum = 10e-10

def verify_result(real_result, golden):
    real_result = np.fromfile(real_result, dtype=np.float16) # Read the actual operation result from the binary file.
    golden = np.fromfile(golden, dtype=np.float16) # Read the expected operation result from the binary file.
    result = np.abs(real_result - golden) # Compute the deviation between the actual result and the expected result.
    deno = np.maximum(np.abs(real_result), np.abs(golden)) # Obtain the maximum value and form a new array.
    result_atol = np.less_equal(result, loss) # Compute the absolute error.
    result_rtol = np.less_equal(result / np.add(deno, minimum), loss) # Compute the relative error.
    if not result_rtol.all() and not result_atol.all():
        if np.sum(result_rtol == False) > real_result.size * loss and np.sum(result_atol == False) > real_result.size * loss:
            print("[ERROR] result error")
            return False
    print("test pass")
    return True

if __name__ == '__main__':
    verify_result(sys.argv[1],sys.argv[2])

Modifying and Executing the Script for One-Click Compilation and Running

You can refer to the one-click script provided in the sample to quickly compile and run the Ascend C operator on the CPU or NPU. The one-click compilation and running script provide the following functions.

Figure 4 Process of operator compiling and running in one-click mode

The one-click compilation and running script provided in the sample does not apply to all operator runtime verification scenarios. Modify the script based on the actual situation.

Compile your script for generating input and truth values based on the algorithm principles of the Ascend C operator.

After compiling the preceding files, you can run script for one-click compilation and running.

The following table describes script parameters and how to execute the script:

bash run.sh --run-mode=npu  --soc-version=<soc_version> --install-path=<install_path>  --build-type=Debug  --install-prefix=<install-prefix>

bash run.sh -r npu  -v <soc_version> -i <install_path> -b Debug -p <install-prefix>

**Table 3** Script parameters
Parameter	Abbreviation	Description
--run-mode	-r	The operator runs in CPU or NPU mode. The value can be cpu or npu. The default value is npu.
--soc-version	-v	Model of the AI processor where the operator runs. NOTE: The AI processor model <soc_version> can be obtained in the following ways: Run the npu-smi info command on the server where the Ascend AI Processor is installed to obtain the Chip Name information. The actual value is AscendChip Name. For example, if Chip Name is xxxyy, the actual value is Ascendxxxyy. The following models are supported: Atlas Training Series Product
--install-path	-i	Set this parameter to the CANN installation path. Change it to the actual path. Default path: $HOME/Ascend/ascend-toolkit/latest
--build-type	-b	Compilation mode options: Release: release version, which does not contain debugging information. The final release version is compiled. Debug: debug version, which contains debugging information for you to develop and debug. Defaults to Debug.
--install-prefix	-p	Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory.

After the script execution, test pass in the output information indicates that the operator accuracy meets the requirements.

Parent topic: Kernel Launch Operator Development