CPU Twin Debugging

This section describes how to debug the CPU domain: kernel function verification on the CPU, GDB debugging, and printing using the printf command.

  • Currently, you can implement CPU domain debugging based on the sample project for directly debugging a kernel.

    In the heterogeneous compilation scenario, if you use command lines or compile a CMake file for compilation, CPU twin debugging is not supported currently and will be gradually supported in later versions.

  • During CPU debugging, configuring log-related environment variables can help you record program running processes and exceptions and debug functions.

    For details about the restrictions and description of environment variables, see Logs.

Verifying the Kernel Function on the CPU

On a non-Ascend device, you can use the CPU simulation environment to develop and test operators, and then use the Ascend device for accelerated computation after the operators are ready. In Compilation and Running, we have introduced the compilation and running of the operator kernel program in the NPU domain. Compared with the operator running logic in the NPU domain, debugging in the CPU domain actually involves compiling the operator kernel program using the standard GCC compiler. In this case, the operator kernel program is linked to the CPU debugging library, and the executable file generated after compilation is executed to verify the running of the operator in the CPU domain. The running program on the CPU is debugged step by step by using the GDB general debugging tool. This can accurately verify whether the program execution process meets the expectation.

Figure 1 Comparison of the kernel function running logic between the CPU and NPU domains

Based on the sample project for directly debugging a kernel, the code of the CPU and NPU domains can be unified when the kernel function is called using the ACLRT_LAUNCH_KERNEL API. This method supports only the following models:

  • Atlas A3 training products / Atlas A3 inference products
  • Atlas A2 training products / Atlas A2 inference products
  • Atlas inference products

If the kernel function is called using <<<>>>, you need to use a separate debugging API to perform operations such as memory allocation, and perform macro isolation on the code of the CPU and NPU domains.

The following code uses the add_custom operator as an example to describe how to compile the application called by the operator when the operator kernel function is verified on the CPU (by calling the kernel function using ACLRT_LAUNCH_KERNEL). When implementing your own applications, pay attention to the modifications caused by different operator kernel functions, including different operator kernel function names and input and output parameters. Properly allocate memory, copy memory, and read/write files. You can directly reuse the calling methods of related APIs.

  1. Include header files as needed.
    1
    2
    3
    #include "data_utils.h"
    #include "acl/acl.h"
    #include "aclrtlaunch_add_custom.h"
    
  2. Compile the application framework.
    1
    2
    3
    4
    5
    6
    7
    8
    int32_t main(int32_t argc, char* argv[])
    {
        uint32_t blockDim = 8;
        size_t inputByteSize = 8 * 2048 * sizeof(uint16_t);
        size_t outputByteSize = 8 * 2048 * sizeof(uint16_t);
        // Program for calling the operator
        return 0;
    }
    
  3. Verify the running.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
        // Initialization
        CHECK_ACL(aclInit(nullptr));
        // Allocate runtime resources.
        int32_t deviceId = 0;
        CHECK_ACL(aclrtSetDevice(deviceId));
        aclrtStream stream = nullptr;
        CHECK_ACL(aclrtCreateStream(&stream));
        // Allocate the host memory.
        uint8_t *xHost, *yHost, *zHost;
        uint8_t *xDevice, *yDevice, *zDevice;
    
        CHECK_ACL(aclrtMallocHost((void**)(&xHost), inputByteSize));
        CHECK_ACL(aclrtMallocHost((void**)(&yHost), inputByteSize));
        CHECK_ACL(aclrtMallocHost((void**)(&zHost), outputByteSize));
        // Allocate the device memory.
        CHECK_ACL(aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
        CHECK_ACL(aclrtMalloc((void**)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
        CHECK_ACL(aclrtMalloc((void**)&zDevice, outputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));
        // Initialize the host memory.
        ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize);
        ReadFile("./input/input_y.bin", inputByteSize, yHost, inputByteSize);
        // Copy data from the host to the device.
        CHECK_ACL(aclrtMemcpy(xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
        CHECK_ACL(aclrtMemcpy(yDevice, inputByteSize, yHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE));
        // Call ACLRT_LAUNCH_KERNEL to use the kernel function to complete the specified operation.
        ACLRT_LAUNCH_KERNEL(add_custom)(blockDim, stream, xDevice, yDevice, zDevice);
        CHECK_ACL(aclrtSynchronizeStream(stream));
        // Copy the computation result from the device to the host.
        CHECK_ACL(aclrtMemcpy(zHost, outputByteSize, zDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST));
        WriteFile("./output/output_z.bin", zHost, outputByteSize);
        // Release allocated resources.
        CHECK_ACL(aclrtFree(xDevice));
        CHECK_ACL(aclrtFree(yDevice));
        CHECK_ACL(aclrtFree(zDevice));
        CHECK_ACL(aclrtFreeHost(xHost));
        CHECK_ACL(aclrtFreeHost(yHost));
        CHECK_ACL(aclrtFreeHost(zHost));
        // Perform deinitialization.
        CHECK_ACL(aclrtDestroyStream(stream));
        CHECK_ACL(aclrtResetDevice(deviceId));
        CHECK_ACL(aclFinalize());
    

To unify the code in the CPU and NPU domains, only some acl APIs are adapted. When using the CPU domain debugging function, you can use only the following acl APIs:

  • APIs with actual functions and callable in the CPU domain
    • aclDataTypeSize, aclFloat16ToFloat, and aclFloatToFloat16
    • aclrtMalloc, aclrtFree, aclrtMallocHost, aclrtFreeHost, aclrtMemset, aclrtMemsetAsync, aclrtMemcpy, aclrtMemcpyAsync, aclrtMemcpy2d, aclrtMemcpy2dAsync, aclrtCreateContext, and aclrtDestroyContext
  • APIs without actual functions, which are implemented through stubs
    • Profile data collection

      aclprofInit, aclprofSetConfig, aclprofStart, aclprofStop, and aclprofFinalize

    • System configuration

      aclInit, aclFinalize, and aclrtGetVersion

    • Runtime management

      aclrtSetDevice, aclrtResetDevice, aclrtCreateStream, aclrtCreateStreamWithConfig, aclrtDestroyStream, aclrtDestroyStreamForce, aclrtSynchronizeStream, aclrtCreateContext, and aclrtDestroyContext

GDB Debugging

You can use GDB to debug the operator computation precision step by step. The CPU debugging has been changed to multi-process debugging, and each core starts over an independent subprocess. Therefore, the GDB needs to be changed to the subprocess debugging mode. For the Atlas inference products and Atlas training products , each core starts one subprocess.For the Atlas A2 training products / Atlas A2 inference products , each core starts three subprocesses, one cube, and two vectors

  • Debug a single subprocess.

    Start GDB. In the example, add_custom_cpu is the executable file of the operator in the CPU domain. Set run-mode in the one-click compilation and running script to cpu by referring to Modifying and Executing the Script for One-Click Compilation and Running to compile and generate the executable file of the operator in the CPU domain.

    After GDB is started, set the tracing subprocess and then set breakpoints. The program will stay in the subprocess. However, with this method, the program stays only in the first subprocess that encounters a breakpoint, and other subprocesses and the main process continue to execute until they exit. Operators involving inter-core synchronization cannot be debugged using this method.
    gdb --args add_custom_cpu  // Start the GDB. add_custom_cpu is the executable file of the operator.
    (gdb) set follow-fork-mode child
  • Debug multiple subprocesses.

    If inter-core synchronization is involved, multiple subprocesses need to be debugged in parallel.

    After the GDB is started, set the debugging mode to debug only one process and suspend other processes. The command is as follows:

    1
    (gdb) set detach-on-fork off
    

    Run the following command to view the current debugging mode:

    1
    (gdb) show detach-on-fork
    

    The GDB program captures and interrupts the fork event. In this way, the GDB program can be interrupted each time a subprocess is started. The command is as follows:

    1
    (gdb) catch fork
    

    After r is executed, you can view the current process information.

    1
    2
    3
    (gdb) info inferiors
      Num  Description
    * 1    process 19613
    

    When the fork command is executed for the first time, the program is disconnected from the fork position of the main process, and the subprocess is not generated.

    After c is executed, check info inferiors again. You can see that the first subprocess is started.

    1
    2
    3
    4
    (gdb) info inferiors
      Num  Description 
    * 1    process 19613
      2    process 19626
    

    In this case, you can switch to the second process, that is, the first subprocess, and then add a breakpoint for debugging. In this case, the main process is suspended.

    1
    2
    3
    4
    5
    6
    (gdb) inferior 2
    [Switching to inferior 2 [process 19626] ($HOME/demo)]
    (gdb) info inferiors
      Num  Description
      1    process 19613
    * 2    process 19626
    

    Note that the number following inferior is the sequence number of the process, not the process number.

    If synchronization is blocked, you can switch back to the main process to continue generating subprocesses, and then switch to a new subprocess for debugging. After the synchronization conditions are met, switch back to the first subprocess to continue execution.

The following is an example of commands for debugging a single subprocess:

gdb --args add_custom_cpu
set follow-fork-mode child
break add_custom.cpp:45
run
list
backtrace
print i
break add_custom.cpp:56
continue
display xLocal
quit

printf

Compile printf(...) in the code to observe the value output. The sample code is as follows:
1
2
printf("xLocal size: %d\n", xLocal.GetSize()); 
printf("tileLength: %d\n", tileLength);