Checking Memory Statistics of Each CANN Component

  1. Check and analyze logs.
    Search for the keyword _svm_mem_stats_show in the user-mode plog on the host. The following is an example of the search result:
    run/plog/plog-4176893_20240629161823165.log:460:[INFO] DRV(4176893,python3.7):2024-06-29-16:19:28.436.386 [ascend][curpid: 4176893, 4185695][drv][devmm][_svm_mem_stats_show 148]SVM_MEM Mem stats (Bytes). (module_name=RUNTIME; module_id=7; current_alloced_size=19927040; alloced_peak_size=19927040; alloc_cnt=37; free_cnt=0)
    run/plog/plog-4176893_20240629161823165.log:461:[INFO] DRV(4176893,python3.7):2024-06-29-16:19:28.436.393 [ascend][curpid: 4176893, 4185695][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev6 Mem stats (Bytes). (module_name=HCCL; module_id=3; current_alloced_size=419467264; alloced_peak_size=419500032; alloc_cnt=6; free_cnt=2)
    run/plog/plog-4176893_20240629161823165.log:463:[INFO] DRV(4176893,python3.7):2024-06-29-16:19:28.436.401 [ascend][curpid: 4176893, 4185695][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev6 Mem stats (Bytes). (module_name=APP; module_id=33; current_alloced_size=64680361984; alloced_peak_size=64680361984; alloc_cnt=3113; free_cnt=0)

    Log level: Memory statistics are printed when the log level is ERROR (memory application fails) or the process exits.

    Printing format: [Memory attribute] Mem stats (Bytes). (module_name=[Module name]; module_id=[ID]; current_alloced_size=[Size]; alloced_peak_size=[Size]; alloc_cnt=[Size]; free_cnt=[Size])

    • Memory attribute: SVM_MEM (address of page faults, used in CANN), DEV_MEM (dev memory), HOST_MEM (host memory), or DVPP_MEM (DVPP memory)
    • module_name: module name, for example, GE, RUNTIME, or DVPP.
    • current_alloced_size: current memory size (bytes) occupied by the module.
    • alloced_peak_size: peak memory size (bytes) occupied by the module.
    • alloc_cnt: number of times that memory is allocated.
    • free_cnt: number of times that memory is freed. If the number of times that memory is freed does not match the number of times that memory is allocated, check whether the memory usage is proper. For example, the upper-layer service framework manages the memory in memory pool mode or memory leakage occurs.
  2. Check which component whose allocated memory increases causes OOM.
    • If you run the application for the first time, check whether the memory usage of each component meets the expectation or exceeds the physical memory of the hardware based on the memory statistics in the logs. If the memory usage does not meet the expectation or exceeds the physical memory of the hardware, adjust the code logic or replan the memory usage.

      Take a model with 13 billion parameters as an example. Each parameter is of the float32 type and occupies 32-bit memory, that is, 4-byte memory. 1 GB equals to 10243 bytes. The 13-billion model occupies about the memory of 48.4 GB (13 x 109 * 4 (bytes) ÷ 10243). If the current hardware memory is only about 50 GB, there is a high probability that the memory required for running the model exceeds the physical memory of the hardware, causing OOM.

    • If the application is not run for the first time, compare the memory statistics in the historical success scenario with that in the current failure scenario, and check the memory attribute and the value of alloced_peak_size for each module (Check the DEV_MEM attribute since most problems occur because the allocated dev memory is insufficient). Find the component whose peak value increases and focus on the component whose allocated memory is greatly different from that in the historical success scenario.

      The following is an example of the memory statistics for a historical success scenario (about 2 GB memory is allocated for the application):

      [INFO] DRV(4052516,main_aarch64):2024-07-08-22:27:45.354.856 [ascend][curpid: 4052516, 4052516][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev0 Mem stats (Bytes). (module_name=RUNTIME; module_id=7; current_alloced_size=44138496; alloced_peak_size=44138496; alloc_cnt=18; free_cnt=0)
      [INFO] DRV(4052516,main_aarch64):2024-07-08-22:27:45.354.866 [ascend][curpid: 4052516, 4052516][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev0 Mem stats (Bytes). (module_name=APP; module_id=33; current_alloced_size=2078195712; alloced_peak_size=2078195712; alloc_cnt=996; free_cnt=0)

      The following is an example of the memory statistics for the existing failure scenario (about 20 GB memory is allocated for the application):

      [INFO] DRV(4052522,main_aarch64):2024-07-09-22:47:56.352.884 [ascend][curpid: 4052522, 4052522][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev0 Mem stats (Bytes). (module_name=RUNTIME; module_id=7; current_alloced_size=44138496; alloced_peak_size=44138496; alloc_cnt=18; free_cnt=0)
      [INFO] DRV(4052522,main_aarch64):2024-07-09-22:47:56.352.894 [ascend][curpid: 4052522, 4052522][drv][devmm][_svm_mem_stats_show 148]DEV_MEM dev0 Mem stats (Bytes). (module_name=APP; module_id=33; current_alloced_size=20781957120; alloced_peak_size=20781957120; alloc_cnt=996; free_cnt=0)

    Note:

    • If the statistics shows that the component whose module_name is APP uses a large amount of memory, the application process uses a large amount of memory. In this case, you need to analyze the difference between the allocated memory and the estimated memory. If the difference is large, you need to analyze the cause, locate the fault, and optimize the memory allocation logic in the application code.
    • If the value of module_name is GE, RUNTIME, or HCCL, the memory is used by the CANN component. Check whether the memory usage of the CANN component is high based on the following memory usage values:
      • In training scenarios, the memory usage of CANN components varies depending on the framework. The following uses the PyTorch framework as an example. The memory usage of key components is listed below for reference. If a CANN component uses a large amount of memory, contact technical support.
        • GE: about 3 MB
        • RUNTIME: about 26 MB
        • HCCL: Memory usage = Memory used by communication links + Buffer memory. The memory used by communication links depends on the cluster scale and communication links. The memory used by buffers depends on the number of communication domains and the buffer size used by a single communication domain.

          For example, if there are 1024 servers in a cluster, 10 communication links and three communication domains need to be established, and the sending and receiving buffer size used by a single communication domain is 2 multiplied by HCCL_BUFFSIZE (HCCL_BUFFSIZE is an environment variable and is user-defined. The default value is 200 MB), the memory usage calculation formulas are as follows: Memory used by communication links = 10 x 4 MB (in single-operator mode) or 10 x Number of operators in the graph x 0.3 MB (in graph mode); Buffer memory usage = 3 x 2 x 200 MB.

      • In the inference scenario, take the PyTorch model as an example. After the model is converted into the Ascend OM model, inference is performed on the Ascend. The memory usage of key components is listed below for reference. If a CANN component uses a large amount of memory, contact technical support.
        • GE: about 86 MB
        • RUNTIME: about 18 MB