Monitoring Metric File of NPU Network Port Statistics

File Description

File content: Includes statistics on packets sent and received by NPU network ports collected by hccn_tool or an automation script.
Naming constraint: npu_(\d+)_details.csv, for example, npu_0_details.csv, where 0 indicates the NPU ID.
Constraints on the storage path:
- Collection directory/environment_check/
- ${Paths specified by --env_check}/
- For details, see Log Collection Directory Structure.

You need to create a monitoring metric file of network port statistics for each NPU.

Collection Mode Description

MindCluster Ascend FaultDiag can collect logs of a training or inference job in either of the following ways:

Script-based collection: Run the net_data_collect.py script to collect the monitoring metric file of NPU network port statistics. For details, see Log Collection Scripts.
CLI-based collection: During a training and inference job, use hccn_tool tool to query the NPU network port statistics every 15 seconds.

CLI-based Collection

Command example:

/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -stat -g

Records all indicators and their values and saves them in a .csv file, as shown in Table 1.

Command output:

packet statistics:
mac_tx_mac_pause_num:0
mac_rx_mac_pause_num:0
mac_tx_pfc_pkt_num:0
...
roce_qp_status_err_num:0
nic_tx_all_pkg_num:122404
nic_tx_all_oct_num:16921741
nic_rx_all_pkg_num:6414803
nic_rx_all_oct_num:482237805

The parameter name in each command output is used as the table header. The parameter value is used as the value, and is saved as a .csv file.

**Table 1** Storage format
timestamp	mac_tx_mac_pause_num	...	mac_rx_mac_pause_num	mac_tx_pfc_pkt_num	mac_tx_pfc_pri0_pkt_num	...
1684460336	0	...	0	0	0	...
1684460354	0	...	0	0	0	...

Parent topic: Collection During Training or Inference