NPU Status Monitoring Metric File

File Description

  • File content: Includes monitoring metrics such as the NPU rated frequency, current power, and temperature collected by the npu-smi tool or an automation script.
  • Naming constraint: npu_smi_(\d+)_details.csv, for example, in npu_smi_0_details.csv, where 0 indicates the NPU device ID.
  • Constraints on the storage path:

You need to create a monitoring metric file of network port status for each NPU.

Collection Mode Description

MindCluster Ascend FaultDiag can collect NPU network port status monitoring files in either of the following ways:

  • Script-based collection: Run the npu_data_collect.py script to collect NPU status monitoring metric files. For details, see Log Collection Scripts.
  • CLI-based collection: During a training and inference job, use the npu-smi tool to query the NPU status every 15 seconds.

CLI-based Collection

Example command:

/usr/local/bin/npu-smi info -t common -i ${device_id}

Record the data of all cards in sequence, as well as the values of NPU ID, Aicore Usage Rate, Aicore Freq(MHZ), Aicore curFreq(MHZ), Temperature, NPU Real-time Power(W), and HBM Usage Rate, and save them in a .csv file, as shown in Table 1.

Command output:

        NPU ID                         : 0
        Chip Count                     : 1
        Chip ID                        : 0
        Memory Usage Rate(%)           : 6
        HBM Usage Rate(%)              : 0
        Aicore Usage Rate(%)           : 0
        Aicore Freq(MHZ)               : 900
        Aicore curFreq(MHZ)            : 900
        Aicore Count                   : 30
        Temperature(C)                 : 41
        NPU Real-time Power(W)         : 71.7

Save the parameter indicators in each command output to a .csv file.

Table 1 Format

time

dev_id

hbm_rate

aicore_rate

rated_freq

freq

temp

power

1683862905

2

0

0

1000

1000

42

70.3

1683862925

2

0

0

1000

1000

42

70.5

  • time: current collection time of the UNIX system
  • dev_id: corresponds to NPU ID in the command output.
  • hbm_rate: on-chip memory usage, corresponding to HBM Usage Rate(%) in the command output.
  • aicore_rate: AI Core usage, corresponding to Aicore Usage Rate(%) in the command output.
  • rated_freq: NPU rated frequency, corresponding to Aicore Freq(MHZ) in the command output.
  • freq: real-time NPU frequency, corresponding to Aicore curFreq(MHZ) in the command output
  • temp: NPU temperature, corresponding to Temperature(C) in the command output.
  • power: NPU power consumption, corresponding to NPU Real-time Power(W) in the command output.