NPU Status Monitoring Metric File
File Description
- File content: Includes monitoring metrics such as the NPU rated frequency, current power, and temperature collected by the npu-smi tool or an automation script.
- Naming constraint: npu_smi_(\d+)_details.csv, for example, in npu_smi_0_details.csv, where 0 indicates the NPU device ID.
- Constraints on the storage path:
- Collection directory/environment_check/
- ${--Paths specified by --env_check}/
- For details, see Log Collection Directory Structure.
You need to create a monitoring metric file of network port status for each NPU.
Collection Mode Description
MindCluster Ascend FaultDiag can collect NPU network port status monitoring files in either of the following ways:
- Script-based collection: Run the npu_data_collect.py script to collect NPU status monitoring metric files. For details, see Log Collection Scripts.
- CLI-based collection: During a training and inference job, use the npu-smi tool to query the NPU status every 15 seconds.
CLI-based Collection
Example command:
/usr/local/bin/npu-smi info -t common -i ${device_id}
Record the data of all cards in sequence, as well as the values of NPU ID, Aicore Usage Rate, Aicore Freq(MHZ), Aicore curFreq(MHZ), Temperature, NPU Real-time Power(W), and HBM Usage Rate, and save them in a .csv file, as shown in Table 1.
Command output:
NPU ID : 0
Chip Count : 1
Chip ID : 0
Memory Usage Rate(%) : 6
HBM Usage Rate(%) : 0
Aicore Usage Rate(%) : 0
Aicore Freq(MHZ) : 900
Aicore curFreq(MHZ) : 900
Aicore Count : 30
Temperature(C) : 41
NPU Real-time Power(W) : 71.7
Save the parameter indicators in each command output to a .csv file.
time |
dev_id |
hbm_rate |
aicore_rate |
rated_freq |
freq |
temp |
power |
|---|---|---|---|---|---|---|---|
1683862905 |
2 |
0 |
0 |
1000 |
1000 |
42 |
70.3 |
1683862925 |
2 |
0 |
0 |
1000 |
1000 |
42 |
70.5 |
- time: current collection time of the UNIX system
- dev_id: corresponds to NPU ID in the command output.
- hbm_rate: on-chip memory usage, corresponding to HBM Usage Rate(%) in the command output.
- aicore_rate: AI Core usage, corresponding to Aicore Usage Rate(%) in the command output.
- rated_freq: NPU rated frequency, corresponding to Aicore Freq(MHZ) in the command output.
- freq: real-time NPU frequency, corresponding to Aicore curFreq(MHZ) in the command output
- temp: NPU temperature, corresponding to Temperature(C) in the command output.
- power: NPU power consumption, corresponding to NPU Real-time Power(W) in the command output.
Parent topic: Collection During Training or Inference