Kernel Launch Based on a Sample Project

This section describes how to perform kernel launch based on a sample project, which will not be further evolved. You are advised to use the command line or compile a CMake file for operator build. For details, see AI Core Operator Compilation.

The following uses the Add vector operator as an example to describe the process of operator development based on kernel launch.

For more sample operator projects, visit the following links:

Environment Setup

Before using the kernel launch operator project, install the driver firmware and CANN by referring to Environment Setup to set up the development environment and operating environment.
This operator project requires CMake 3.16 or later. If the version does not meet the requirement, run the following command (an example) to upgrade the CMake to 3.16.0.
```
wget https://cmake.org/files/v3.16/cmake-3.16.0.tar.gz 
tar -zxvf cmake-3.16.0.tar.gz 
cd cmake-3.16.0
 ./bootstrap --prefix=/usr 
sudo make 
sudo make install
```

Project Directory

Click Vector Operator Sample to obtain a complete sample of kernel function development and runtime verification. Structure of the sample directory:

AddKernelInvocationNeo
|-- cmake                                                 // CMake build file
|-- scripts
|  ├── gen_data.py                                     // Script for generating the input data and truth value
|  ├── verify_result.py                                // Script for checking whether the output data is consistent with the truth value
|-- CMakeLists.txt                                        // CMake build configuration file
|-- add_custom.cpp                                        // Kernel implementation of the vector operator
|-- data_utils.h                                          // Data read and write functions
|-- main.cpp                                              // Main function for calling operators, on the CPU or NPU
|-- run.sh                                                // Script for operator build and running

Operator development procedure based on the operator project:

Complete kernel implementation of the operator.
Compile the application (main.cpp) that calls the operator.
Compile the CMake build configuration file: CMakeLists.txt.
Modify the gen_data.py script for generating the input data and true value data based on the site requirements. Modify the verify_result.py script for verifying whether the output data is consistent with the truth value data.
Modify the run.sh script for compiling and running the operator as required and execute the script to compile and run the operator and verify the result.

Operator Implementation on the Kernel

Compile the Ascend C operator implementation file by referring to the kernel implementations of the vector, cube, and fused operators in the project directory.

An operator kernel implementation file can define only one kernel function.

Operator Calling Program

The following code uses the add_custom operator with a static shape as an example. It describes how to compile the main.cpp file. When implementing your own applications, pay attention to the modifications caused by different operator kernel functions, including different operator kernel function names and input and output parameters. Properly allocate memory, copy memory, and read/write files. You can directly reuse the calling methods of related APIs.

Include header files as required. The ASCENDC_CPU_DEBUG macro is used to distinguish the header files to be included on the CPU/NPU. Note that the header file aclrtlaunch_{kernel_name}.h (automatically generated by the project framework) where the declaration of the corresponding kernel function calling API is located must be included on the NPU. kernel_name indicates the name of the operator kernel function.

        
             #include "data_utils.h"
#ifndef ASCENDC_CPU_DEBUG
#include "acl/acl.h"
#include "aclrtlaunch_add_custom.h"
#else
#include "tikicpulib.h"
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z);
#endif

Compile the application framework. The application uses the ASCENDC_CPU_DEBUG macro to determine whether the code logic runs on the CPU or NPU.

        
             int32_t main(int32_t argc, char* argv[])
{
    uint32_t blockDim = 8;
    size_t inputByteSize = 8 * 2048 * sizeof(uint16_t);
    size_t outputByteSize = 8 * 2048 * sizeof(uint16_t);

#ifdef ASCENDC_CPU_DEBUG
    // Program for calling CPU debugging APIs
#else
    // Program for calling the operator on the NPU
#endif
    return 0;
}

Verify the running on the CPU. To implement the runtime verification of the operator kernel function on the CPU, perform the following steps.

Figure 1 Runtime verification on the CPU

        
                 // Call GmAlloc to allocate shared memory and initialize data.
    uint8_t* x = (uint8_t*)AscendC::GmAlloc(inputByteSize);
    uint8_t* y = (uint8_t*)AscendC::GmAlloc(inputByteSize);
    uint8_t* z = (uint8_t*)AscendC::GmAlloc(outputByteSize);

    ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);
    ReadFile("./input/input_y.bin", inputByteSize, y, inputByteSize);
    // Set the kernel mode to AIV for vector operators.
    AscendC::SetKernelMode(KernelMode::AIV_MODE);
    // Call the ICPU_RUN_KF debugging macro to call the kernel function on the CPU.
    ICPU_RUN_KF(add_custom, blockDim, x, y, z);
    // Write output data.
    WriteFile("./output/output_z.bin", z, outputByteSize);
    // Call GmFree to release allocated resources.
    AscendC::GmFree((void *)x);
    AscendC::GmFree((void *)y);
    AscendC::GmFree((void *)z);

Verify the running on the NPU. To implement the runtime verification of the operator kernel function on the NPU, perform the following steps.

Figure 2 Runtime verification on the NPU

         
          
            
            
                  // Initialization
    CHECK_ACL(aclInit(nullptr));
    // Allocate runtime resources.
    int32_t deviceId = 0;
    CHECK_ACL(aclrtSetDevice(deviceId));
    aclrtStream stream = nullptr;
    CHECK_ACL(aclrtCreateStream(&stream));
    // Allocate the host buffer.
    uint8_t *xHost, *yHost, *zHost;
    uint8_t *xDevice, *yDevice, *zDevice;

    CHECK_ACL(aclrtMallocHost((void**)(&xHost), inputByteSize));
    CHECK_ACL(aclrtMallocHost((void**)(&yHost), inputByteSize));
    CHECK_ACL(aclrtMallocHost((void**)(&zHost), outputByteSize));
    // Allocate the device buffer.
    CHECK_ACL(aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
    CHECK_ACL(aclrtMalloc((void**)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
    CHECK_ACL(aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
    // Initialize the host buffer.
    ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);
    ReadFile("./input/input_y.bin", inputByteSize, yHost, inputByteSize);
    // Copy data from the host to the device.
    CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
    CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, yHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
    // Use the kernel launch symbol <<<>>> to call the kernel function to complete the specified operation. <<<>>> call is encapsulated in add_custom_do.
    add_custom_do(blockDim, nullptr, stream, xDevice, yDevice, zDevice);
    // Call ACLRT_LAUNCH_KERNEL to use the kernel function to complete the specified operation.
    // ACLRT_LAUNCH_KERNEL(add_custom)(blockDim, stream, xDevice, yDevice, zDevice);
    CHECK_ACL(aclrtSynchronizeStream(stream));
    // Copy the computation result from the device to the host.
    CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
    WriteFile("./output/output_z.bin", zHost, outputByteSize);
    // Release allocated resources.
    CHECK_ACL(aclrtFree(xDevice));
    CHECK_ACL(aclrtFree(yDevice));
    CHECK_ACL(aclrtFree(zDevice));
    CHECK_ACL(aclrtFreeHost(xHost));
    CHECK_ACL(aclrtFreeHost(yHost));
    CHECK_ACL(aclrtFreeHost(zHost));
    // Perform deinitialization.
    CHECK_ACL(aclrtDestroyStream(stream));
    CHECK_ACL(aclrtResetDevice(deviceId));
    CHECK_ACL(aclFinalize());

             

           

         
        

The <<<>>> calling method has been described in Kernel Function. This section describes only the usage of ACLRT_LAUNCH_KERNEL.

          
               ACLRT_LAUNCH_KERNEL(kernel_name)(blockDim, stream, argument list);

kernel_name: name of the operator kernel function.
blockDim: number of cores on which the kernel function will be executed. A logical ID, that is, block_idx, is allocated to each core that executes the kernel function, and block_idx can be obtained by calling GetBlockIdx during the kernel function implementation.
stream, aclrtStream type. Streams preserve the order of a stack of asynchronous operations being executed on the device. For details about management APIs for stream creation and other operations, see "Stream Management".
argument list: list of arguments, which must be the same as the argument list of the kernel function.

Compiling the CMake Build Configuration File

This section describes some key environment variables and CMake command parameters in the CMake file. Generally, you do not need to modify these parameters, but you can use them to better understand compilation principles and facilitate CMake customization.

**Table 1** Description of environment variables
Environment Variable	Configuration Description
SOC_VERSION	AI processor model. For the following products: Run the npu-smi info command on the server where Ascend AI Processor is installed to obtain the Name information. The actual value is AscendName. For example, if Name is xxxyy, the actual value is Ascendxxxyy. Atlas A2 training products / Atlas A2 inference products Atlas 200I/500 A2 inference products Atlas inference products Atlas training products For the following products: Run the npu-smi info -t board -i id -c chip_id command on the server where Ascend AI Processor is installed to obtain the Chip Name and NPU Name information. The actual value is Chip Name_NPU Name. For example, if the value of Chip Name is Ascendxxx and the value of NPU Name is 1234, the actual value is Ascendxxx_1234. Note that: id: device ID, which is the NPU ID obtained by running the npu-smi info -l command. chip_id: chip ID, which is obtained by running the npu-smi info -m command. Atlas A3 training products / Atlas A3 inference products
ASCEND_CANN_PACKAGE_PATH	Actual installation path of CANN.
CMAKE_BUILD_TYPE	Compilation mode options: Release: release version, which does not contain debugging information. The final release version is compiled. Debug: debug version, which contains debugging information for you to develop and debug.
CMAKE_INSTALL_PREFIX	Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory.
CMAKE_CXX_COMPILER_LAUNCHER	Configures the C++ compiler (such as g++) and BiSheng Compiler launcher as ccache. After configuration, you can enable cache compilation to accelerate repeated compilation and improve the compilation efficiency. Before using this function, you need to install ccache. The following shows how to perform configuration in the corresponding CMakeLists.txt file: set(CMAKE_CXX_COMPILER_LAUNCHER <launcher_program>) <launcher_program> indicates the ccache installation path, for example, /usr/bin/ccache. set(CMAKE_CXX_COMPILER_LAUNCHER /usr/bin/ccache)

**Table 2** Syntax of the CMake command
CMake Command	Description
add_executable	Adds an executable to the project using the specified source file. The method is the same as that of using common command parameters of CMake.
ascendc_library	Adds a library to a project using the specified kernel function source file. The syntax format is as follows: ascendc_library(<target_name> [STATIC \| SHARED] [<source>...]) <target_name> indicates the name of the library file. The library file is created based on the source files listed in the command. STATIC and SHARED are used to specify the generated library type. The STATIC library is the archive file of the target file and is used when other targets are linked. The SHARED library is dynamically linked (dynamic link library) and loaded at run time. <source> indicates the source file of the kernel function.
ascendc_fatbin_library	Compiles a kernel binary file using the specified kernel function source file, which is used by the kernel loading and execution APIs. The syntax format is as follows: ascendc_fatbin_library(<target_name> [<source>...]) <target_name> indicates the name of the library file. The library file is compiled from the kernel function source files listed in the command, generating a <target_name>.o file. This file is stored in the directory ${CMAKE_INSTALL_PREFIX}/fatbin/${target_name}/. <source> indicates the source file of the kernel function. NOTE: The process of calling the kernel loading and execution APIs is different from the process of calling <<<...>>> and other APIs described before. For details, see "Kernel Loading and Execution" in the Application Development Guide (C&C++). This compilation option does not support printf, DumpTensor, DumpAccChkPoint, and assert.
ascendc_compile_definitions	Adds a compilation macro. You can add custom compilation macros and compilation macros provided by the Ascend C. The syntax format is as follows: ascendc_compile_definitions(<target_name> [PRIVATE] [<xxx>...]) Compilation macros provided by the Ascend C: ASCENDC_DUMP is used to control the dump function. By default, the dump function is enabled. After you call printf/DumpTensor/assert, information is printed. (Note that the host function exists in the kernel file of the launch project. If printf is called in the host function, the printf initialization in the kernel will be triggered, affecting the kernel execution performance.) If this macro is defined as 0, the dump function is disabled. The following is an example: // Disable the printf printing function of all operators. ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE ASCENDC_DUMP=0 ) ASCENDC_DEBUG is used to control the debugging function of Ascend C APIs. By default, the debugging function is disabled. After this compilation macro is added, the debugging function is enabled. In this case, the assert verification inside the API takes effect. If the verification fails, assert logs are printed. The printing function affects the actual running performance of the operator. Therefore, this function is usually used in the debugging phase. The following is an example: ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE ASCENDC_DEBUG ) Currently, the ASCENDC_DEBUG function is supported by the following product models: Atlas inference products Atlas A2 training products / Atlas A2 inference products HAVE_WORKSPACE is used to indicate whether the kernel entrypoint includes a workspace argument. By default, the workspace argument is not included. After this compilation macro is added, the workspace argument is included. In this case, the framework obtains the last parameter (when HAVE_TILING is not configured) or the second-to-last parameter (when HAVE_TILING is configured) of the kernel argument list, and automatically sets the system workspace on the kernel. The workspace obtained from the argument on the kernel is the user workspace after the system workspace is offset. When you use high-level APIs that require the system workspace, such as Matmul Kernel APIs, you are advised to enable this function. The argument layout and system workspace setting logic are the same as those of Project-based Operator Development, reducing the modification cost caused by switching between different development modes. Note that developers on the host still need to allocate workspace space. The size of the system workspace can be obtained through the GetLibApiWorkSpaceSize API of PlatformAscendCManager. A configuration example of HAVE_WORKSPACE is as follows: ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE HAVE_WORKSPACE ) HAVE_TILING indicates whether the kernel entrypoint includes a tiling argument. This compilation macro takes effect only after HAVE_WORKSPACE is configured. By default, the tiling argument is not included and the macro is disabled. After the compilation macro is added, the tiling argument is included. In this case, the framework considers the last parameter in the kernel argument list as tiling argument and the second-to-last parameter as the workspace argument. The framework does not process the tiling argument. It only determines the location of the workspace parameter based on the argument. This compilation macro can be used to ensure that the argument is the same as that of Project-based Operator Development, reducing the modification cost caused by switching between different development modes of operator implementation. A configuration example is as follows: ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE HAVE_WORKSPACE HAVE_TILING )
ascendc_compile_options	Adds compilation options. You can add compilation options for the compilation on the host and device. The syntax format is as follows: ascendc_compile_options(<target_name> PRIVATE [<xxx>...] ) By default, the specified compilation options are transferred to the compiler on the device for compilation. To transfer compilation options to the host compiler, use -forward-options-to-host-compiler. The following is an example: ascendc_compile_options(<target_name> PRIVATE -g -forward-options-to-host-compiler -gdwarf-4 ) As shown in the preceding code, during compilation, the -g compilation option is transferred to the compiler on the device, and the -gdwarf-4 compilation option is transferred to the compiler on the host. Note: Compilation options on the host support only the compilation options supported by both the G++ and Clang compilers.
ascendc_include_directories	Adds custom search path of the header file. The syntax format is as follows: ascendc_include_directories(<target_name> [PRIVATE] [<xxx>...])

The following figure shows the simplified compilation process: Compile the operator kernel function source file to generate a library file (*.so or *.a library file) on the kernel. The project framework automatically generates the header file for the declaration of the kernel function calling API. When compiling main.cpp (operator calling program), the preceding header file is used. The target file generated by the compiled program is linked to the library file on the kernel to generate the final executable file.

Figure 3 Simplified compilation process

After the compilation and installation are finished, the compilation results are generated in the CMAKE_INSTALL_PREFIX directory. The final executable file is generated in the directory where CMake executes the command.

out
├── lib 
│   ├── libkernels1.a
│   ├── libkernels2.so
├── include
│   ├── kernels1
│           ├── aclrtlaunch_matmul_custom.h
│           ├── aclrtlaunch_add_custom.h
│   ├── kernels2
│           ├── aclrtlaunch_xxx.h
│           ├── ...

You can use the msobjdump tool to parse the library file generated in the lib directory to obtain the kernel information. For details, see msobjdump.

Script for Generating and Verifying the Input Data and Truth Value

The following uses the add_custom operator with a static shape as an example to describe how to generate the input data and truth value. Compile the script based on the input and output of the operator to generate the input data and truth value.

#!/usr/bin/python3
# -*- coding:utf-8 -*-
import numpy as np

def gen_golden_data_simple():
    input_x = np.random.uniform(1, 100, [8, 2048]).astype(np.float16)
    input_y = np.random.uniform(1, 100, [8, 2048]).astype(np.float16)
    golden = (input_x + input_y).astype(np.float16)

    input_x.tofile("./input/input_x.bin")
    input_y.tofile("./input/input_y.bin")
    golden.tofile("./output/golden.bin")

if __name__ == "__main__":
    gen_golden_data_simple()

The following is an example of the script for verifying whether the output data is consistent with the truth value. The NumPy API is used to compute the absolute error and relative error between the output data and the truth value. If the deviation is within the tolerable range, the precision meets the requirement and test pass is output.

import os
import sys
import numpy as np

loss = 1e-3 # Tolerable deviation. For fp16, the absolute error and relative error should be less than 1‰.
minimum = 10e-10

def verify_result(real_result, golden):
    real_result = np.fromfile(real_result, dtype=np.float16) # Read the actual operation result from the binary file.
    golden = np.fromfile(golden, dtype=np.float16) # Read the expected operation result from the binary file.
    result = np.abs(real_result - golden) # Compute the deviation between the actual result and the expected result.
    deno = np.maximum(np.abs(real_result), np.abs(golden)) # Obtain the maximum value and form a new array.
    result_atol = np.less_equal(result, loss) # Compute the absolute error.
    result_rtol = np.less_equal(result / np.add(deno, minimum), loss) # Compute the relative error.
    if not result_rtol.all() and not result_atol.all():
        if np.sum(result_rtol == False) > real_result.size * loss and np.sum(result_atol == False) > real_result.size * loss:
            print("[ERROR] result error")
            return False
    print("test pass")
    return True

if __name__ == '__main__':
    verify_result(sys.argv[1],sys.argv[2])

Modifying and Executing the Script for One-Click Compilation and Running

You can refer to the one-click script provided in the sample to quickly compile and run the Ascend C operator on the CPU or NPU. The one-click compilation and running script provide the following functions.

Figure 4 Process of operator compiling and running in one-click mode

The one-click compilation and running script provided in the sample does not apply to all operator runtime verification scenarios. Modify the script based on the actual situation.

Compile your script for generating input and truth values based on the algorithm principles of the Ascend C operator.

After compiling the preceding files, you can run script for one-click compilation and running.

The following table describes script parameters and how to execute the script:

bash run.sh --run-mode=npu  --soc-version=<soc_version> --install-path=<install_path>  --build-type=Debug  --install-prefix=<install-prefix>

bash run.sh -r npu  -v <soc_version> -i <install_path> -b Debug -p <install-prefix>

**Table 3** Script parameters
Parameter	Abbreviation	Description
--run-mode	-r	The operator runs in CPU or NPU mode. The value can be cpu or npu.
--soc-version	-v	Model of the AI processor where the operator runs. NOTE: The AI processor model <soc_version> can be obtained in the following ways: For the following products: Run the npu-smi info command on the server where Ascend AI Processor is installed to obtain the Name information. The actual value is AscendName. For example, if Name is xxxyy, the actual value is Ascendxxxyy. Atlas A2 training products / Atlas A2 inference products Atlas 200I/500 A2 inference products Atlas inference products Atlas training products For the following products: Run the npu-smi info -t board -i id -c chip_id command on the server where Ascend AI Processor is installed to obtain the Chip Name and NPU Name information. The actual value is Chip Name_NPU Name. For example, if the value of Chip Name is Ascendxxx and the value of NPU Name is 1234, the actual value is Ascendxxx_1234. Note that: id: device ID, which is the NPU ID obtained by running the npu-smi info -l command. chip_id: chip ID, which is obtained by running the npu-smi info -m command. Atlas A3 training products / Atlas A3 inference products
--install-path	-i	Set this parameter to the CANN installation path. Change it to the actual path. Default path: $HOME/Ascend/ascend-toolkit/latest
--build-type	-b	Compilation mode options: Release: release version, which does not contain debugging information. The final release version is compiled. Debug: debug version, which contains debugging information for you to develop and debug. Defaults to Debug.
--install-prefix	-p	Specifies the prefix of the installation path when the CMake executes the install command. Compilation results (target specified in ascendc_library and the corresponding header file) are installed in this path. The path is defaulted to the out directory in the current directory.

After the script execution, test pass in the output information indicates that the operator precision meets the requirements.

Parent topic: Appendix