NPU Environment Check File After Training or Inference
File Description
- After a training or inference job is finished, use hccn_tool or a script to query and record the IP address, mask, statistics on received and sent packets, and historical link statistics of each NPU network port. After training, use npu-smi or a script to query the processor health information.
- Naming constraint: npu_info_after.txt
- Constraints on the storage path:
- Collection directory/environment_check/
- ${--Paths specified by --env_check}/
- For details, see Log Collection Directory Structure.
Collection Mode Description
MindCluster Ascend FaultDiag can collect NPU environment check files after training or inference jobs are finished in either of the following ways:
- Script-based collection: Run the npu_info_collect.sh script to collect NPU environment check files. For details, see Log Collection Scripts.
- CLI-based collection: Run commands to collect NPU environment check files.
CLI-based Collection
- After training or inference jobs are finished, run the corresponding commands to query the NPU environment check files, and save the query commands and query results to the npu_info_after.txt file. The involved commands and examples are as follows:
- Query the network health status.
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -net_health -gCommand output:net health status: Init
- Query the RoCE physical link connection status.
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link -gCommand output:link status: UP
- Query information about the RoCE network optical module.
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -optical -gCommand output:optical info: present : not present ... Tx Power : 4.4035 mW Rx Power : 1.0189 mW Vcc High Thres : 3465.00 mV Vcc Low Thres : 3135.00 mV Temp High Thres : 70 C Temp Low Thres : 0 C TxPower High Thres : 3.5481 mW TxPower Low Thres : 0.2818 mW RxPower High Thres : 3.5481 mW RxPower Low Thres : 0.1445 mW Tx Bias : 7.9360 mA Tx Los Flag : 0x0 Rx Los Flag : 0xff Tx LoL Flag : 0x0 Rx LoL Flag : 0xff ...
- Query the TLS switch configuration.
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -tls -g | grep switchCommand output:dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
- Query the FEC mode.
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -fec -gCommand output:fec mode: rs FEC mode
- Query the IP address and mask.
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -ip -gCommand output:
ipaddr:10.xx.xx.10 netmask:255.255.255.0
- Query statistics about sent and received packets.
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -stat -gCommand output:
packet statistics: mac_tx_mac_pause_num:0 mac_rx_mac_pause_num:0 mac_tx_pfc_pkt_num:0 ... roce_qp_status_err_num:0 nic_tx_all_pkg_num:122404 nic_tx_all_oct_num:16921741 nic_rx_all_pkg_num:6414803 nic_rx_all_oct_num:482237805
- Query the historical link statistics of the network port.
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link_stat -gCommand output:
[device 0]current time : Wed Jun 7 10:08:28 2023 [device 0]link up count : 2 [device 0]link change records : [device 0] Tue Jun 6 16:32:12 2023 LINK UP [device 0] Tue Jun 6 16:32:10 2023 LINK DOWN [device 0] Tue Jun 6 16:31:55 2023 LINK UP
The following is an example of file storage of device 0. You need to collect information about all devices.
/usr/local/Ascend/driver/tools/hccn_tool -i 0 -net_health -g net health status: Init /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link -g link status: UP /usr/local/Ascend/driver/tools/hccn_tool -i 0 -optical -g optical info: present : not present ... Tx Power : 4.4035 mW Rx Power : 1.0189 mW Vcc High Thres : 3465.00 mV Vcc Low Thres : 3135.00 mV Temp High Thres : 70 C Temp Low Thres : 0 C TxPower High Thres : 3.5481 mW TxPower Low Thres : 0.2818 mW RxPower High Thres : 3.5481 mW RxPower Low Thres : 0.1445 mW Tx Bias : 7.9360 mA Tx Los Flag : 0x0 Rx Los Flag : 0xff Tx LoL Flag : 0x0 Rx LoL Flag : 0xff ... /usr/local/Ascend/driver/tools/hccn_tool -i 0 -tls -g | grep switch dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days /usr/local/Ascend/driver/tools/hccn_tool -i 0 -fec -g fec mode: rs FEC mode /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g ipaddr:10.xx.xx.10 netmask:255.255.255.0 /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g packet statistics: mac_tx_mac_pause_num:0 mac_rx_mac_pause_num:0 mac_tx_pfc_pkt_num:0 ... roce_qp_status_err_num:0 nic_tx_all_pkg_num:122404 nic_tx_all_oct_num:16921741 nic_rx_all_pkg_num:6414803 nic_rx_all_oct_num:482237805 /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link_stat -g [device 0]current time : Wed Jun 7 10:08:28 2023 [device 0]link up count : 2 [device 0]link change records : [device 0] Tue Jun 6 16:32:12 2023 LINK UP [device 0] Tue Jun 6 16:32:10 2023 LINK DOWN [device 0] Tue Jun 6 16:31:55 2023 LINK UP
The result of each collection command must be separated by one line. Example:/usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g XXXX /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g
- Query the network health status.
- Use npu-smi to query the processor health information after training or inference jobs are finished and save the query command and result to the npu_info_after.txt file.
- Query the basic information about the training or inference devices.
/usr/local/bin/npu-smi info
Command output:+------------------------------------------------------------------------------------------------+ | npu-smi 24.1.rc1 Version: 24.1.rc1 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ ... +===========================+===============+====================================================+ | 7 xxx | OK | 67.0 44 0 / 0 | | 0 | 0000:3D:00.0 | 0 2505 / 15567 0 / 32768 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 0 | +===========================+===============+====================================================+ ... | No running processes found in NPU 7 | +===========================+===============+====================================================+
- Query ECC of the high-bandwidth memory.
/usr/local/bin/npu-smi info -i ${device_id} -t eccCommand output:NPU ID : 1 Chip Count : 1
- Query the basic information about the hardware.
/usr/local/bin/npu-smi info -i ${device_id} -t boardCommand output:NPU ID : 0 Software Version : 23.0.5 Firmware Version : 7.1.0.7.220 Compatibility : OK Board ID : 0x02 PCB ID : A BOM ID : 1 PCIe Bus Info : 0000:61:00.0 Slot ID : 0 Class ID : NA PCI Vendor ID : 0x19e5 PCI Device ID : 0xD801 Subsystem Vendor ID : 0x0200 Subsystem Device ID : 0x0100 Chip Count : 1 - Query the basic hardware information and the name of the specified device.
/usr/local/bin/npu-smi info -i ${device_id} -c 0 -t boardCommand output:NPU ID : 0 Chip ID : 0 Chip Type : Ascend Chip Name : xxx Chip Version : V1 Board ID : 0x02 PCB ID : NA BOM ID : 1 VDie ID : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003 NDie ID : 27216594 20401010 4E10C8D4 14CC040A A4102003 Chip Position ID : 0 PCIe Bus Info : 0000:61:00.0 Firmware Version : 7.1.0.7.220
- Query the memory usage.
/usr/local/bin/npu-smi info -i ${device_id} -t usagesCommand output:NPU ID : 0 Chip Count : 1 DDR Capacity(MB) : 13553 DDR Usage Rate(%) : 6 DDR Hugepages Total(page) : 0 DDR Hugepages Usage Rate(%) : 0 HBM Capacity(MB) : 32768 HBM Usage Rate(%) : 0 Aicore Usage Rate(%) : 0 Aicpu Usage Rate(%) : 0 Ctrlcpu Usage Rate(%) : 0 DDR Bandwidth Usage Rate(%) : 0 HBM Bandwidth Usage Rate(%) : 0 Chip ID : 0
- Query the processor health information.
/usr/local/bin/npu-smi info -i ${device_id} -c 0 -t healthCommand output:Health Status : OK Error Code : NA Error Information : NA
The following is an example of file storage. You need to collect information about all devices./usr/local/bin/npu-smi info +------------------------------------------------------------------------------------------------+ | npu-smi 23.0.5 Version: 23.0.5 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 0 xxx | OK | 73.1 37 0 / 0 | | 0 | 0000:61:00.0 | 0 920 / 13553 0 / 32768 | +===========================+===============+====================================================+ ... +===========================+===============+====================================================+ | 7 xxx | OK | 67.0 38 0 / 0 | | 0 | 0000:3D:00.0 | 0 2346 / 15567 0 / 32768 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 0 | +===========================+===============+====================================================+ ... +===========================+===============+====================================================+ | No running processes found in NPU 7 | +===========================+===============+====================================================+ /usr/local/bin/npu-smi info -i 0 -c 0 -t health Health Status : OK Error Code : NA Error Information : NA /usr/local/bin/npu-smi info -i 0 -t ecc NPU ID : 0 Chip Count : 1 DDR Single Bit Error Count : 0 DDR Double Bit Error Count : 0 DDR Single Bit Aggregate Total Err Cnt : 0 DDR Double Bit Aggregate Total Err Cnt : 0 DDR Single Bit Isolated Pages Count : 0 DDR Double Bit Isolated Pages Count : 0 HBM Single Bit Error Count : 0 HBM Double Bit Error Count : 0 HBM Single Bit Aggregate Total Err Cnt : 0 HBM Double Bit Aggregate Total Err Cnt : 0 HBM Single Bit Isolated Pages Count : 0 HBM Double Bit Isolated Pages Count : 0 Chip ID : 0 /usr/local/bin/npu-smi info -i 0 -t board NPU ID : 0 Software Version : 23.0.5 Firmware Version : 7.1.0.7.220 Compatibility : OK Board ID : 0x02 PCB ID : A BOM ID : 1 PCIe Bus Info : 0000:61:00.0 Slot ID : 0 Class ID : NA PCI Vendor ID : 0x19e5 PCI Device ID : 0xD801 Subsystem Vendor ID : 0x0200 Subsystem Device ID : 0x0100 Chip Count : 1 /usr/local/bin/npu-smi info -i 0 -c 0 -t board NPU ID : 0 Chip ID : 0 Chip Type : Ascend Chip Name : xxx Chip Version : V1 Board ID : 0x02 PCB ID : NA BOM ID : 1 VDie ID : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003 NDie ID : 27216594 20401010 4E10C8D4 14CC040A A4102003 Chip Position ID : 0 PCIe Bus Info : 0000:61:00.0 Firmware Version : 7.1.0.7.220 /usr/local/bin/npu-smi info -i 0 -t usages NPU ID : 0 Chip Count : 1 DDR Capacity(MB) : 13553 DDR Usage Rate(%) : 6 DDR Hugepages Total(page) : 0 DDR Hugepages Usage Rate(%) : 0 HBM Capacity(MB) : 32768 HBM Usage Rate(%) : 0 Aicore Usage Rate(%) : 0 Aicpu Usage Rate(%) : 0 Ctrlcpu Usage Rate(%) : 0 DDR Bandwidth Usage Rate(%) : 0 HBM Bandwidth Usage Rate(%) : 0 Chip ID : 0 /usr/local/bin/npu-smi info -i 0 -c 0 -t health Health Status : OK Error Code : NA Error Information : NA ...
The result of each collection command must be separated by one line. Example:/usr/local/bin/npu-smi info -i 0 -c 0 -t health XXXX /usr/local/bin/npu-smi info -i 1 -c 0 -t health
- Query the basic information about the training or inference devices.
- After training or inference jobs are finished, run other related commands to query the NPU environment check files, and save the query commands and query results to the npu_info_after.txt file. The involved commands and examples are as follows:
- Query the current system time.
datetime=$(date "+%Y-%m-%d %H:%M:%S") echo "Datetime: $datetime">>${save_file} echo -e "\n">>${save_file}Command output:Datetime: 2024-06-26 01:13:36
- Query the driver version.
cat /usr/local/Ascend/driver/version.info
Command output:Version=24.1.rc1 ascendhal_version=7.35.19 aicpu_version=1.0 tdt_version=1.0 log_version=1.0 prof_version=2.0 dvppkernels_version=1.1 tsfw_version=1.0 Innerversion=V100R001C15SPC006B220 compatible_version=[V100R001C30],[V100R001C13],[V100R001C15],[V100R001C17] compatible_version_fw=[7.0.0,7.2.99]
- Query the firmware version.
cat /usr/local/Ascend/firmware/version.info
Command output:
Version=7.1.0.11.220 firmware_version=1.0 package_version=23.0.7 compatible_version_drv=[23.0.rc3,23.0.rc3.],[23.0.0,23.0.0.]
- Query the CANN version (AArch64).
cat /usr/local/Ascend/cann/aarch64-linux/ascend_toolkit_install.info
Command output:
package_name=Ascend-cann-toolkit version=8.5.0 innerversion=V100R001C25SPC001B212 compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23] arch=aarch64 os=linux path=/usr/local/Ascend/cann-8.5.0/aarch64-linux
- Query the CANN version (x86_64).
cat /usr/local/Ascend/cann/x86_64-linux/ascend_toolkit_install.info
Command output:
package_name=Ascend-cann-toolkit version=8.5.0 innerversion=V100R001C25SPC001B212 compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23] arch=x86_64 os=linux path=/usr/local/Ascend/cann-8.5.0/x86_64-linux
- Query the AI framework version.
pip list | grep "torch " pip list | grep torch-npu pip list | grep "mindspore "
Command output:
torch 1.11.0 torch-npu 2.1.0.post8.dev20241009 mindspore 2.3.0
- Query the firmware version details.
/usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version
Command output:{ Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(0). {"device_id":0, "component":nve, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(3). {"device_id":0, "component":uefi, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(8). {"device_id":0, "component":imu, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(9). {"device_id":0, "component":imp, "version":7.1.0.7.220} … Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(0). {"device_id":7, "component":nve, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(3). {"device_id":7, "component":uefi, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(8). {"device_id":7, "component":imu, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(9). {"device_id":7, "component":imp, "version":7.1.0.7.220} }
- Query the current system time.
Parent topic: Collection After Training or Inference