Host Resource Information

File Description

File content: Includes information collected by the top command or an automation script, such as the total physical memory used by the host, CPU usage (%CPU) used by the main training or inference process of each NPU, and used physical memory (RES). The data is stored in a JSON file, for example, host_metrics_${core_num}.json.
Naming constraint: host_metrics_${core_num}.json, for example, host_metrics_64.json, where 64 indicates the number of CPU cores.
Constraints on the storage path:
- Collection directory/environment_check/
- ${--Paths specified by --env_check}/
- For details, see Log Collection Directory Structure.

Collection Mode Description

MindCluster Ascend FaultDiag can collect the host resource information in either of the following ways:

Script-based collection: Run the host_resource_collect.py script to collect the host resource information. For details, see Log Collection Scripts.
CLI-based collection: Collect the host resource information by running commands.

CLI-based Collection

Before training or inference, run the following command to query the total number of CPU cores of the training or inference device:
```
cat /proc/cpuinfo | grep "processor" | wc -l
```

During training or inference, run the npu-smi info command to query the process ID of each device and record all process IDs as {pid_list}.

/usr/local/bin/npu-smi info

Command output:

+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.rc3          Version: 23.0.rc2.3                                      |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     xxx                | OK            | 73.4        44                1123 / 1123          |
| 0                         | 0000:C1:00.0  | 0           4565 / 15137      30710/ 32768         |
+===========================+===============+====================================================+
| 1     xxx                | OK            | 69.6        39                1123 / 1123          |
| 0                         | 0000:81:00.0  | 0           4483 / 15137      30710/ 32768         |
+===========================+===============+====================================================+
| 2     xxx                | OK            | 70.0        36                1123 / 1123          |
| 0                         | 0000:41:00.0  | 0           4437 / 15137      30710/ 32768         |
+===========================+===============+====================================================+
| 3     xxx                | OK            | 69.6        44                1123 / 1123          |
| 0                         | 0000:01:00.0  | 0           3845 / 15039      30709/ 32768         |
+===========================+===============+====================================================+
| 4     xxx                | OK            | 71.3        40                1123 / 1123          |
| 0                         | 0000:C2:00.0  | 0           4296 / 15137      30709/ 32768         |
+===========================+===============+====================================================+
| 5     xxx                | OK            | 67.0        36                1123 / 1123          |
| 0                         | 0000:82:00.0  | 0           3758 / 15137      30709/ 32768         |
+===========================+===============+====================================================+
| 6     xxx                | OK            | 71.7        37                1123 / 1123          |
| 0                         | 0000:42:00.0  | 0           4581 / 15137      30710/ 32768         |
+===========================+===============+====================================================+
| 7     xxx                | OK            | 69.1        42                1123 / 1123          |
| 0                         | 0000:02:00.0  | 0           4690 / 15039      30710/ 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 139667        | python                   | 30780                   |
+===========================+===============+====================================================+
| 1       0                 | 139577        | python                   | 30782                   |
+===========================+===============+====================================================+
| 2       0                 | 139446        | python                   | 30780                   |
+===========================+===============+====================================================+
| 3       0                 | 139372        | python                   | 30780                   |
+===========================+===============+====================================================+
| 4       0                 | 139258        | python                   | 30780                   |
+===========================+===============+====================================================+
| 5       0                 | 139163        | python                   | 30780                   |
+===========================+===============+====================================================+
| 6       0                 | 139126        | python                   | 30780                   |
+===========================+===============+====================================================+
| 7       0                 | 139090        | python                   | 30780                   |
+===========================+===============+====================================================+

During training or inference, run the top command to query the resource usage and record the total physical memory used by the host, PID of each process, physical memory used by each process, and CPU usage of each process.

top -p {pid_list} -n 1 -b

Example command:

top -p 139667,139577,139446,139372,139258 ,139163,139126,139090 -n 1 -b

Command output:

top - 14:15:53 up 39 days, 22:54,  9 users,  load average: 28.32, 10.28, 5.44
Tasks: 2727 total,   9 running, 1261 sleeping,   1 stopped,   0 zombie
%Cpu(s):  5.6 us,  5.4 sy,  0.0 ni, 89.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 80358528+total, 57884742+free, 70817856 used, 15392000+buff/cache
KiB Swap:        0 total,        0 free,        0 used. 67941792+avail Mem
 
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
139667 root      20   0 8203.5g   3.4g 526208 R 309.5  0.4   1:46.26 python
139577 root      20   0 8203.5g   3.4g 526208 R 214.3  0.4   1:25.03 python
139446 root      20   0 8203.5g   3.4g 526144 R 204.8  0.4   1:54.20 python
139372 root      20   0 8203.5g   3.4g 526144 R 314.3  0.4   2:10.20 python
139258 root      20   0 8203.5g   3.4g 526144 R 209.5  0.4   1:23.53 python
139163 root      20   0 8203.5g   3.4g 526144 R 309.5  0.4   2:18.71 python
139126 root      20   0 8203.5g   3.4g 526144 R 109.5  0.4   0:58.54 python
139090 root      20   0 8203.5g   3.4g 526144 R 409.5  0.4   2:07.01 python

The format of the saved file is as follows:

Collect the total physical memory used by the host and the PID, RES, and %CPU information of the training or inference process, and record each piece of information in the format of [Unix timestamp, Metric value]. Save the information in a JSON file named host_metrics_${core_num}.json in the following format:

host_metrics_${core_num}.json:
{
"node_mem_used": [[Unix timestamp, Metric value],...],
"node_rss_{pid}": [[Unix timestamp, Metric value],...],
"node_cpu_{pid}": [[Unix timestamp, Metric value],...],
}

core_num: total number of CPU cores of the device.
node_rss_${pid}: indicator list of the physical memory used by a process, which corresponds to RES and is stored in groups by PID.
node_cpu_${pid}: metric list of the CPU usage of a process, which corresponds to %CPU and is stored in groups by PID.
node_mem_used: indicator list of the total physical memory used by the host, which corresponds to KiB Mem: xxx used.

If the collected host resource information contains a large amount of abnormal data, the device resource analysis result for further fault diagnosis may be abnormal, hindering the identification of the actual problem.

Storage example:

{
"node_mem_used": [[1689732534, 10259988480],[1689732594, 10259988481]],
"node_rss_139667": [[1689732534, 353370112],[1689732594, 353370115]],
"node_cpu_139667": [[1689732534, "12.0"],[1689732594, "13.0"]],
"node_rss_139577": [[1689732534, 224591872],[1689732594, 224591877]],
"node_cpu_139577": [[1689732534, "24.0"],[1689732594, "27.0"]],
"node_rss_139446": [[1689732534, 127008768],[1689732594, 127008769]],
"node_cpu_139446": [[1689732534, "16.0"],[1689732534, "19.0"]]
...
}

Parent topic: Collection During Training or Inference