Checking the Memory of the CANN Software Stack

For scenarios where memory exceptions may occur when user programs call CANN software stack APIs, the msSanitizer tool provides the memory check capability for device-related APIs and AscendCL-related APIs, allowing users to locate memory exceptions.

Memory Leak Detection Principles

When the device memory queried by running the npu-smi info command keeps increasing, you can use this tool to locate memory leak. If memory leak occurs on AscendCL series APIs, you can locate the code line.

As shown in Figure 1, the CANN software stack memory operation APIs consist of two levels: the lower device-side APIs provided by the driver and the upper AscendCL series APIs provided by the driver.
Figure 1 Memory Check

To locate a memory leak, perform the following steps:

  1. Enable leak detection for device series APIs to determine whether memory leak occurs on the host. If no, the leak occurs on the device. If yes, go to the next step to check whether AscendCL API call leak occurs.
  2. Enable leak detection for AscendCL series APIs to determine whether leak occurs when user code calls AscendCL APIs. If no, the problem is not caused by AscendCL API calls. If yes, go to the next step to locate the specific code line.
  3. Use the new APIs provided by the msSanitizer detection tool to recompile the header file, and then use the detection tool to start the detection program to locate the file name and code line number corresponding to the allocation function that does not free the memory. For details about the new APIs, see msSanitizer External APIs.

Troubleshooting Procedure

  1. Configure environment variables by referring to Before You Start.
  2. Check whether memory leak occurs on the host.
    1. Use the msSanitizer tool to start the program to be checked. An example command is as follows:
      mssanitizer --check-device-heap=yes --leak-check=yes ./add_npu

      The path of the user program to be checked (for example, add_custom_npu) can be set to either an absolute path or a relative path according to the actual situation.

    2. If no error information is displayed, the detection program is running properly and no memory leak occurs on the host. If the following error information is displayed, memory leak occurs on the host.
      The following command output indicates that one memory allocation on the host is not deallocated. As a result, 32800-byte memory is leaked.
      1
      2
      3
      4
      5
      6
      ====== ERROR: LeakCheck: detected memory leaks
        
      ======    Direct leak of 32800 byte(s) 
      ======      at 0x124080024000 on GM allocated in <unknown>:0 (serialNo:0)
        
      ====== SUMMARY: 32800 byte(s) leaked in 1 allocation(s)
      
  3. Check whether the leak is caused by AscendCL API calls.
    1. Use the msSanitizer tool to start the program to be checked. An example command is as follows:
      mssanitizer --check-cann-heap=yes --leak-check=yes ./add_npu
    2. If no exception information is displayed, the detection program is running successfully and no memory leak occurs during the AscendCL API calls. If the following error information is displayed, memory leak occurs during the AscendCL API calls.
      The following information indicates that one memory allocation is not deallocated when the AscendCL API is called. As a result, 32768-byte memory is leaked.
      1
      2
      3
      4
      5
      6
      ====== ERROR: LeakCheck: detected memory leaks
      
      ======    Direct leak of 32768 byte(s) 
      ======      at 0x124080024000 on GM allocated in <unknown>:0 (serialNo:0)
      
      ====== SUMMARY: 32768 byte(s) leaked in 1 allocation(s)
      
  4. If memory leak occurs, use the msSanitizer API header file acl.h and the corresponding dynamic library file provided by the msSanitizer tool to locate the code file and code line where memory leak occurs.

    When locating the code file and code line where the leak occurs, replace the original acl/acl.h header file in the user code with the msSanitizer API header file acl.h provided by the tool, link the libascend_acl_hook.so file to your application project and recompile the application project. For details, see Importing API Header Files and Linking Dynamic Libraries.

  5. Use the msSanitizer tool to restart the program. The following is a command example:
    mssanitizer --check-cann-heap=yes --leak-check=yes ./add_npu

    The following information indicates that the one memory allocation is not deallocated in line 55 of the main.cpp file of the application. Then you can locate the cause of the memory leak.

    1
    2
    3
    4
    5
    6
    ====== ERROR: LeakCheck: detected memory leaks  
    
    ======    Direct leak of 32768 byte(s) 
    ======     at 0x124080024000 on GM allocated in main.cpp:55 (serialNo:0)
    
    ====== SUMMARY: 32768 byte(s) leaked in 1 allocation(s)
    

Importing API Header Files and Linking Dynamic Libraries

The following uses the kernel launch symbol scenario of the as an example to describe how to import the msSanitizer API header file acl.h and link the dynamic library file. For other types of custom projects, adjust the file based on the actual script.

  1. Click Link to obtain the sample project for verifying the code.
    When downloading the code sample, run the following command to specify the branch version:
    git clone https://gitee.com/ascend/samples.git -b v0.2-8.0.0.beta1
  2. In the ${git_clone_path}/samples/operator/ascendc/0_introduction/3_add_kernellaunch/AddKernelInvocationNeo directory, replace the acl/acl.h header file introduced by the main.cpp file with the acl.h header file provided by msSanitizer.
    1
    2
    3
    4
    5
    6
    7
    #include "data_utils.h"
    #ifndef ASCENDC_CPU_DEBUG
    // #include "acl/acl.h"
    // Replace acl/acl.h with acl.h.
    #include "acl.h"
    extern void add_custom_do(uint32_t blockDim, void *stream, uint8_t *x, uint8_t *y, uint8_t *z);
    #else
    
  3. Edit the CMakeLists.txt file in the ${git_clone_path}/samples/operator/ascendc/0_introduction/3_add_kernellaunch/AddKernelInvocationNeo directory and import the API header file path ${INSTALL_DIR}/tools/mssanitizer/include/acl and dynamic library path ${INSTALL_DIR}/tools/mssanitizer/lib64/libascend_acl_hook.so.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    add_executable(ascendc_kernels_bbit ${CMAKE_CURRENT_SOURCE_DIR}/main.cpp)
    
    target_compile_options(ascendc_kernels_bbit PRIVATE
        $<BUILD_INTERFACE:$<$<STREQUAL:${RUN_MODE},cpu>:-g>>
        -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -Wall -Werror
    )
    # Import the path of the API header file during the compilation of the operator executable file.
    target_include_directories(ascendc_kernels_bbit PUBLIC
        $ENV{ASCEND_HOME_PATH}/tools/mssanitizer/include/acl)
    # Import the path of the libascend_acl_hook.so dynamic library when the operator executable file is linked.
    target_link_directories(ascendc_kernels_bbit PRIVATE
        $ENV{ASCEND_HOME_PATH}/tools/mssanitizer/lib64)
    
    target_link_libraries(ascendc_kernels_bbit PRIVATE
        $<BUILD_INTERFACE:$<$<OR:$<STREQUAL:${RUN_MODE},npu>,$<STREQUAL:${RUN_MODE},sim>>:host_intf_pub>>
        $<BUILD_INTERFACE:$<$<STREQUAL:${RUN_MODE},cpu>:ascendcl>>
        ascendc_kernels_${RUN_MODE}
        # Link the operator executable file to the libascend_acl_hook.so dynamic library.
        ascend_acl_hook
    )
    
  4. Import environment variables and recompile the operator.

    Run the npu-smi info command on the server where the Ascend AI Processor is installed to obtain the Chip Name information. The actual value is AscendChip Name. For example, if Chip Name is xxxyy, the actual value is Ascendxxxyy. If Ascendxxxyy is the code sample path, you need to set ascendxxxyy.

    export LD_LIBRARY_PATH=${ASCEND_HOME_PATH}/tools/mssanitizer/lib64:$LD_LIBRARY_PATH
    mssanitizer --check-cann-heap=yes --leak-check=yes -- bash run.sh -r npu -v Ascendxxxyy