Monitoring Metric File of NPU Network Port Statistics
File Description
- File content: Includes statistics on packets sent and received by NPU network ports collected by hccn_tool or an automation script.
- Naming constraint: npu_(\d+)_details.csv, for example, npu_0_details.csv, where 0 indicates the NPU ID.
- Constraints on the storage path:
- Collection directory/environment_check/
- ${Paths specified by --env_check}/
- For details, see Log Collection Directory Structure.
You need to create a monitoring metric file of network port statistics for each NPU.
Collection Mode Description
MindCluster Ascend FaultDiag can collect logs of a training or inference job in either of the following ways:
- Script-based collection: Run the net_data_collect.py script to collect the monitoring metric file of NPU network port statistics. For details, see Log Collection Scripts.
- CLI-based collection: During a training and inference job, use hccn_tool tool to query the NPU network port statistics every 15 seconds.
CLI-based Collection
Command example:
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -stat -g
Records all indicators and their values and saves them in a .csv file, as shown in Table 1.
Command output:
packet statistics: mac_tx_mac_pause_num:0 mac_rx_mac_pause_num:0 mac_tx_pfc_pkt_num:0 ... roce_qp_status_err_num:0 nic_tx_all_pkg_num:122404 nic_tx_all_oct_num:16921741 nic_rx_all_pkg_num:6414803 nic_rx_all_oct_num:482237805
The parameter name in each command output is used as the table header. The parameter value is used as the value, and is saved as a .csv file.
Parent topic: Collection During Training or Inference