Host Resource Information

File Description

  • File content: Includes information collected by the top command or an automation script, such as the total physical memory used by the host, CPU usage (%CPU) used by the main training or inference process of each NPU, and used physical memory (RES). The data is stored in a JSON file, for example, host_metrics_${core_num}.json.
  • Naming constraint: host_metrics_${core_num}.json, for example, host_metrics_64.json, where 64 indicates the number of CPU cores.
  • Constraints on the storage path:

Collection Mode Description

MindCluster Ascend FaultDiag can collect the host resource information in either of the following ways:

  • Script-based collection: Run the host_resource_collect.py script to collect the host resource information. For details, see Log Collection Scripts.
  • CLI-based collection: Collect the host resource information by running commands.

CLI-based Collection

  • Before training or inference, run the following command to query the total number of CPU cores of the training or inference device:
    cat /proc/cpuinfo | grep "processor" | wc -l
  • During training or inference, run the npu-smi info command to query the process ID of each device and record all process IDs as {pid_list}.
    /usr/local/bin/npu-smi info

    Command output:

    +------------------------------------------------------------------------------------------------+
    | npu-smi 23.0.rc3          Version: 23.0.rc2.3                                      |
    +---------------------------+---------------+----------------------------------------------------+
    | NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
    | Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
    +===========================+===============+====================================================+
    | 0     xxx                | OK            | 73.4        44                1123 / 1123          |
    | 0                         | 0000:C1:00.0  | 0           4565 / 15137      30710/ 32768         |
    +===========================+===============+====================================================+
    | 1     xxx                | OK            | 69.6        39                1123 / 1123          |
    | 0                         | 0000:81:00.0  | 0           4483 / 15137      30710/ 32768         |
    +===========================+===============+====================================================+
    | 2     xxx                | OK            | 70.0        36                1123 / 1123          |
    | 0                         | 0000:41:00.0  | 0           4437 / 15137      30710/ 32768         |
    +===========================+===============+====================================================+
    | 3     xxx                | OK            | 69.6        44                1123 / 1123          |
    | 0                         | 0000:01:00.0  | 0           3845 / 15039      30709/ 32768         |
    +===========================+===============+====================================================+
    | 4     xxx                | OK            | 71.3        40                1123 / 1123          |
    | 0                         | 0000:C2:00.0  | 0           4296 / 15137      30709/ 32768         |
    +===========================+===============+====================================================+
    | 5     xxx                | OK            | 67.0        36                1123 / 1123          |
    | 0                         | 0000:82:00.0  | 0           3758 / 15137      30709/ 32768         |
    +===========================+===============+====================================================+
    | 6     xxx                | OK            | 71.7        37                1123 / 1123          |
    | 0                         | 0000:42:00.0  | 0           4581 / 15137      30710/ 32768         |
    +===========================+===============+====================================================+
    | 7     xxx                | OK            | 69.1        42                1123 / 1123          |
    | 0                         | 0000:02:00.0  | 0           4690 / 15039      30710/ 32768         |
    +===========================+===============+====================================================+
    +---------------------------+---------------+----------------------------------------------------+
    | NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
    +===========================+===============+====================================================+
    | 0       0                 | 139667        | python                   | 30780                   |
    +===========================+===============+====================================================+
    | 1       0                 | 139577        | python                   | 30782                   |
    +===========================+===============+====================================================+
    | 2       0                 | 139446        | python                   | 30780                   |
    +===========================+===============+====================================================+
    | 3       0                 | 139372        | python                   | 30780                   |
    +===========================+===============+====================================================+
    | 4       0                 | 139258        | python                   | 30780                   |
    +===========================+===============+====================================================+
    | 5       0                 | 139163        | python                   | 30780                   |
    +===========================+===============+====================================================+
    | 6       0                 | 139126        | python                   | 30780                   |
    +===========================+===============+====================================================+
    | 7       0                 | 139090        | python                   | 30780                   |
    +===========================+===============+====================================================+
  • During training or inference, run the top command to query the resource usage and record the total physical memory used by the host, PID of each process, physical memory used by each process, and CPU usage of each process.
    top -p {pid_list} -n 1 -b
    Example command:
    top -p 139667,139577,139446,139372,139258 ,139163,139126,139090 -n 1 -b

    Command output:

    top - 14:15:53 up 39 days, 22:54,  9 users,  load average: 28.32, 10.28, 5.44
    Tasks: 2727 total,   9 running, 1261 sleeping,   1 stopped,   0 zombie
    %Cpu(s):  5.6 us,  5.4 sy,  0.0 ni, 89.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem : 80358528+total, 57884742+free, 70817856 used, 15392000+buff/cache
    KiB Swap:        0 total,        0 free,        0 used. 67941792+avail Mem
     
       PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    139667 root      20   0 8203.5g   3.4g 526208 R 309.5  0.4   1:46.26 python
    139577 root      20   0 8203.5g   3.4g 526208 R 214.3  0.4   1:25.03 python
    139446 root      20   0 8203.5g   3.4g 526144 R 204.8  0.4   1:54.20 python
    139372 root      20   0 8203.5g   3.4g 526144 R 314.3  0.4   2:10.20 python
    139258 root      20   0 8203.5g   3.4g 526144 R 209.5  0.4   1:23.53 python
    139163 root      20   0 8203.5g   3.4g 526144 R 309.5  0.4   2:18.71 python
    139126 root      20   0 8203.5g   3.4g 526144 R 109.5  0.4   0:58.54 python
    139090 root      20   0 8203.5g   3.4g 526144 R 409.5  0.4   2:07.01 python

    The format of the saved file is as follows:

    Collect the total physical memory used by the host and the PID, RES, and %CPU information of the training or inference process, and record each piece of information in the format of [Unix timestamp, Metric value]. Save the information in a JSON file named host_metrics_${core_num}.json in the following format:

    host_metrics_${core_num}.json:
    {
    "node_mem_used": [[Unix timestamp, Metric value],...],
    "node_rss_{pid}": [[Unix timestamp, Metric value],...],
    "node_cpu_{pid}": [[Unix timestamp, Metric value],...],
    }
    • core_num: total number of CPU cores of the device.
    • node_rss_${pid}: indicator list of the physical memory used by a process, which corresponds to RES and is stored in groups by PID.
    • node_cpu_${pid}: metric list of the CPU usage of a process, which corresponds to %CPU and is stored in groups by PID.
    • node_mem_used: indicator list of the total physical memory used by the host, which corresponds to KiB Mem: xxx used.

    If the collected host resource information contains a large amount of abnormal data, the device resource analysis result for further fault diagnosis may be abnormal, hindering the identification of the actual problem.

    Storage example:

    {
    "node_mem_used": [[1689732534, 10259988480],[1689732594, 10259988481]],
    "node_rss_139667": [[1689732534, 353370112],[1689732594, 353370115]],
    "node_cpu_139667": [[1689732534, "12.0"],[1689732594, "13.0"]],
    "node_rss_139577": [[1689732534, 224591872],[1689732594, 224591877]],
    "node_cpu_139577": [[1689732534, "24.0"],[1689732594, "27.0"]],
    "node_rss_139446": [[1689732534, 127008768],[1689732594, 127008769]],
    "node_cpu_139446": [[1689732534, "16.0"],[1689732534, "19.0"]]
    ...
    }