Monitoring Metric File of NPU Network Port Statistics

File Description

  • File content: Includes statistics on packets sent and received by NPU network ports collected by hccn_tool or an automation script.
  • Naming constraint: npu_(\d+)_details.csv, for example, npu_0_details.csv, where 0 indicates the NPU ID.
  • Constraints on the storage path:

You need to create a monitoring metric file of network port statistics for each NPU.

Collection Mode Description

MindCluster Ascend FaultDiag can collect logs of a training or inference job in either of the following ways:

  • Script-based collection: Run the net_data_collect.py script to collect the monitoring metric file of NPU network port statistics. For details, see Log Collection Scripts.
  • CLI-based collection: During a training and inference job, use hccn_tool tool to query the NPU network port statistics every 15 seconds.

CLI-based Collection

Command example:

/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -stat -g

Records all indicators and their values and saves them in a .csv file, as shown in Table 1.

Command output:

packet statistics:
mac_tx_mac_pause_num:0
mac_rx_mac_pause_num:0
mac_tx_pfc_pkt_num:0
...
roce_qp_status_err_num:0
nic_tx_all_pkg_num:122404
nic_tx_all_oct_num:16921741
nic_rx_all_pkg_num:6414803
nic_rx_all_oct_num:482237805

The parameter name in each command output is used as the table header. The parameter value is used as the value, and is saved as a .csv file.

Table 1 Storage format

timestamp

mac_tx_mac_pause_num

...

mac_rx_mac_pause_num

mac_tx_pfc_pkt_num

mac_tx_pfc_pri0_pkt_num

...

1684460336

0

...

0

0

0

...

1684460354

0

...

0

0

0

...