Overview

After implementing the operator on the kernel and tiling on the host, complete operator kernel launch using AscendCL runtime APIs. In this simple and direct method that enables quick verification of operator functions, tiling development is not restricted by the CANN framework.

The following figure shows the development process of a kernel launch operator.

  1. Set up the environment.

    For details about how to install the CANN software, see Environment Setup.

  2. Implement operators.

    For details about the operator implementation on the kernel and tiling implementation on the host, see Operator Implementation.

  3. Call operators.
    Develop the kernel launch project, compile the CMake configuration file, organize related code files based on the following kernel launch project (an example), and complete project build and running.
    |-- cmake                                                 // CMake build file
    |-- CMakeLists.txt                                        // CMake build configuration file
    |-- my_add.cpp                                            // Operator implementation on the kernel
    |-- main.cpp                                              // Kernel launch project

Kernel launch can be performed on the CPU or NPU.

  • CPU: Kernel launch is implemented by the ICPU_RUN_KF CPU debugging macro and other APIs provided by the CPU debugging library.
  • NPU: Kernel launch is implemented by the kernel launch API or the kernel launch symbol <<<>>> and the runtime API provided by AscendCL.

The following figure shows the principle of kernel launch on the CPU and NPU.

Figure 1 Kernel launch on the CPU and NPU

You can perform basic runtime verifications following the process described in Kernel Launch. APIs used in the process are as follows.

Operator programs compiled based on the NPU operator calling API can be compiled and run by the BiSheng Compiler to verify the running of operators on the NPU. Operator programs compiled based on the CPU operator calling API (ICPU_RUN_KF CPU) can be compiled and run by the standard GCC compiler to verify the running of operators on the CPU.

The running program on the CPU is debugged step by step by using the GDB tool, in order to accurately verify whether the program execution process meets the expectation. For details about CPU debugging, see Debugging on the CPU.

Kernel Launch Method

In the current version, obtaining user workspace is not supported.

Use ACLRT_LAUNCH_KERNEL to call the API:

1
ACLRT_LAUNCH_KERNEL(kernel_name)(blockDim, stream, argument list);
  • kernel_name: name of the operator kernel function.
  • blockDim: number of cores on which the kernel function will be executed. A logical ID, that is, block_idx, is allocated to each core that executes the kernel function, and block_idx can be obtained by calling GetBlockIdx during the kernel function implementation.
  • stream, aclrtStream type. Streams preserve the order of a stack of asynchronous operations being executed on the device. For details about management APIs for stream creation and other operations, see Stream Management.
  • argument list: list of arguments, which must be the same as the argument list of the kernel function.

The following uses the calling of a kernel function named add_custom, which performs the Add operation of two vectors:

1
2
//If blockDim is set to 8, the add_custom kernel function is called on eight cores. Each core independently executes the kernel function in parallel. The parameter lists of the kernel function are x, y, and z.
ACLRT_LAUNCH_KERNEL(add_custom)(8, stream, x, y, z)

Kernel Launch Symbol

The kernel function can use the kernel launch symbol <<<...>>> to specify the execution configuration of the kernel function.

1
kernel_name<<<blockDim, l2ctrl, stream>>>(argument list);
  • blockDim specifies the number of cores on which a kernel function will be executed. A logical ID, that is, block_idx, is allocated to each core that executes the kernel function, and block_idx can be obtained by calling GetBlockIdx during the kernel function implementation.

    blockDim is a concept about logical cores, and its value range is [1, 65535]. To fully utilize hardware resources, set this parameter to the number of physical cores or a multiple of the number of physical cores. For the coupled architecture and separated architecture, the meaning and setting rules of blockDim during running are different. The details are as follows:

    • Coupled architecture: Because the Vector and Cube Units are integrated, blockDim is used to start multiple AI Core instances, without differentiating Vector Units and Cube Units. The number of AI Cores can be obtained by calling GetCoreNumAiv or GetCoreNumAic.
    • Separated architecture
      • For operators that contain only Vector Units, blockDim is used to set the number of vector (AIV) instances to be started. For example, if an AI processor has 40 Vector cores, set blockDim to 40.
      • For operators that contain only Cube Units, blockDim is used to set the number of cube (AIC) instances to be started. For example, if an AI processor has 20 Cube cores, set blockDim to 20.
      • Operators for Vector/Cube fusion computing are started by groups of AIVs and AICs. blockDim is used to set the number of groups to be started. For example, if an AI processor has 40 Vector cores and 20 Cube cores, a group consists of two Vector cores and one Cube core. Set the number of groups to 20. In this case, 20 groups are started, including 40 Vector cores and 20 Cube cores. Note: In this scenario, the number of blockDim (logical cores) cannot exceed the number of physical cores (a physical core contains two Vector cores and one Cube core).
      • The number of AIC and AIV cores can be obtained by calling GetCoreNumAic and GetCoreNumAiv, respectively.
  • l2ctrl is a reserved parameter and can be ignored. This parameter is fixed at nullptr.
  • stream is of the aclrtStream type. Streams preserve the order of some asynchronous operations being executed on the device. For details about management APIs for stream creation and other operations, see Stream Management.

The following uses the calling of a kernel function named add_custom, which performs the Add operation of two vectors:

1
2
//If blockDim is set to 8, the add_custom kernel function is called on eight cores. Each core independently executes the kernel function in parallel. The parameter lists of the kernel function are x, y, and z.
add_custom<<<8, nullptr, stream>>>(x, y, z);

For details about calling a kernel function by using the kernel launch symbol, see Defining and Calling Kernel Functions.

The kernel function is called asynchronously. After the kernel function is called, the control right is returned to the host immediately. You can call aclrtSynchronizeStream to force the host program to wait until all kernel functions are executed.

1
aclError aclrtSynchronizeStream(aclrtStream stream);

For details about how to use aclrtSynchronizeStream, see "“AscendCL API” > Synchronization > aclrtSynchronizeStream" in CANN AscendCL Application Software Development Guide (C&C++).