Host Resource Information
File Description
- File content: Includes information collected by the top command or an automation script, such as the total physical memory used by the host, CPU usage (%CPU) used by the main training or inference process of each NPU, and used physical memory (RES). The data is stored in a JSON file, for example, host_metrics_${core_num}.json.
- Naming constraint: host_metrics_${core_num}.json, for example, host_metrics_64.json, where 64 indicates the number of CPU cores.
- Constraints on the storage path:
- Collection directory/environment_check/
- ${--Paths specified by --env_check}/
- For details, see Log Collection Directory Structure.
Collection Mode Description
MindCluster Ascend FaultDiag can collect the host resource information in either of the following ways:
- Script-based collection: Run the host_resource_collect.py script to collect the host resource information. For details, see Log Collection Scripts.
- CLI-based collection: Collect the host resource information by running commands.
CLI-based Collection
- Before training or inference, run the following command to query the total number of CPU cores of the training or inference device:
cat /proc/cpuinfo | grep "processor" | wc -l
- During training or inference, run the npu-smi info command to query the process ID of each device and record all process IDs as {pid_list}.
/usr/local/bin/npu-smi info
Command output:
+------------------------------------------------------------------------------------------------+ | npu-smi 23.0.rc3 Version: 23.0.rc2.3 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 0 xxx | OK | 73.4 44 1123 / 1123 | | 0 | 0000:C1:00.0 | 0 4565 / 15137 30710/ 32768 | +===========================+===============+====================================================+ | 1 xxx | OK | 69.6 39 1123 / 1123 | | 0 | 0000:81:00.0 | 0 4483 / 15137 30710/ 32768 | +===========================+===============+====================================================+ | 2 xxx | OK | 70.0 36 1123 / 1123 | | 0 | 0000:41:00.0 | 0 4437 / 15137 30710/ 32768 | +===========================+===============+====================================================+ | 3 xxx | OK | 69.6 44 1123 / 1123 | | 0 | 0000:01:00.0 | 0 3845 / 15039 30709/ 32768 | +===========================+===============+====================================================+ | 4 xxx | OK | 71.3 40 1123 / 1123 | | 0 | 0000:C2:00.0 | 0 4296 / 15137 30709/ 32768 | +===========================+===============+====================================================+ | 5 xxx | OK | 67.0 36 1123 / 1123 | | 0 | 0000:82:00.0 | 0 3758 / 15137 30709/ 32768 | +===========================+===============+====================================================+ | 6 xxx | OK | 71.7 37 1123 / 1123 | | 0 | 0000:42:00.0 | 0 4581 / 15137 30710/ 32768 | +===========================+===============+====================================================+ | 7 xxx | OK | 69.1 42 1123 / 1123 | | 0 | 0000:02:00.0 | 0 4690 / 15039 30710/ 32768 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | 0 0 | 139667 | python | 30780 | +===========================+===============+====================================================+ | 1 0 | 139577 | python | 30782 | +===========================+===============+====================================================+ | 2 0 | 139446 | python | 30780 | +===========================+===============+====================================================+ | 3 0 | 139372 | python | 30780 | +===========================+===============+====================================================+ | 4 0 | 139258 | python | 30780 | +===========================+===============+====================================================+ | 5 0 | 139163 | python | 30780 | +===========================+===============+====================================================+ | 6 0 | 139126 | python | 30780 | +===========================+===============+====================================================+ | 7 0 | 139090 | python | 30780 | +===========================+===============+====================================================+
- During training or inference, run the top command to query the resource usage and record the total physical memory used by the host, PID of each process, physical memory used by each process, and CPU usage of each process.
top -p {pid_list} -n 1 -bExample command:top -p 139667,139577,139446,139372,139258 ,139163,139126,139090 -n 1 -b
Command output:
top - 14:15:53 up 39 days, 22:54, 9 users, load average: 28.32, 10.28, 5.44 Tasks: 2727 total, 9 running, 1261 sleeping, 1 stopped, 0 zombie %Cpu(s): 5.6 us, 5.4 sy, 0.0 ni, 89.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 80358528+total, 57884742+free, 70817856 used, 15392000+buff/cache KiB Swap: 0 total, 0 free, 0 used. 67941792+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 139667 root 20 0 8203.5g 3.4g 526208 R 309.5 0.4 1:46.26 python 139577 root 20 0 8203.5g 3.4g 526208 R 214.3 0.4 1:25.03 python 139446 root 20 0 8203.5g 3.4g 526144 R 204.8 0.4 1:54.20 python 139372 root 20 0 8203.5g 3.4g 526144 R 314.3 0.4 2:10.20 python 139258 root 20 0 8203.5g 3.4g 526144 R 209.5 0.4 1:23.53 python 139163 root 20 0 8203.5g 3.4g 526144 R 309.5 0.4 2:18.71 python 139126 root 20 0 8203.5g 3.4g 526144 R 109.5 0.4 0:58.54 python 139090 root 20 0 8203.5g 3.4g 526144 R 409.5 0.4 2:07.01 python
The format of the saved file is as follows:
Collect the total physical memory used by the host and the PID, RES, and %CPU information of the training or inference process, and record each piece of information in the format of [Unix timestamp, Metric value]. Save the information in a JSON file named host_metrics_${core_num}.json in the following format:
host_metrics_${core_num}.json: { "node_mem_used": [[Unix timestamp, Metric value],...], "node_rss_{pid}": [[Unix timestamp, Metric value],...], "node_cpu_{pid}": [[Unix timestamp, Metric value],...], }- core_num: total number of CPU cores of the device.
- node_rss_${pid}: indicator list of the physical memory used by a process, which corresponds to RES and is stored in groups by PID.
- node_cpu_${pid}: metric list of the CPU usage of a process, which corresponds to %CPU and is stored in groups by PID.
- node_mem_used: indicator list of the total physical memory used by the host, which corresponds to KiB Mem: xxx used.
If the collected host resource information contains a large amount of abnormal data, the device resource analysis result for further fault diagnosis may be abnormal, hindering the identification of the actual problem.
Storage example:
{ "node_mem_used": [[1689732534, 10259988480],[1689732594, 10259988481]], "node_rss_139667": [[1689732534, 353370112],[1689732594, 353370115]], "node_cpu_139667": [[1689732534, "12.0"],[1689732594, "13.0"]], "node_rss_139577": [[1689732534, 224591872],[1689732594, 224591877]], "node_cpu_139577": [[1689732534, "24.0"],[1689732594, "27.0"]], "node_rss_139446": [[1689732534, 127008768],[1689732594, 127008769]], "node_cpu_139446": [[1689732534, "16.0"],[1689732534, "19.0"]] ... }
Parent topic: Collection During Training or Inference