故障诊断工具支持通过以下方式采集训练及推理前日志:
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -net_health -g
net health status: Init
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link -g
link status: UP
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -optical -g
optical info: present : not present ... Tx Power : 4.4035 mW Rx Power : 1.0189 mW Vcc High Thres : 3465.00 mV Vcc Low Thres : 3135.00 mV Temp High Thres : 70 C Temp Low Thres : 0 C TxPower High Thres : 3.5481 mW TxPower Low Thres : 0.2818 mW RxPower High Thres : 3.5481 mW RxPower Low Thres : 0.1445 mW Tx Bias : 7.9360 mA Tx Los Flag : 0x0 Rx Los Flag : 0xff Tx LoL Flag : 0x0 Rx LoL Flag : 0xff ...
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -tls -g | grep switch
dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -fec -g
fec mode: rs FEC mode
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -ip -g
回显如下:
ipaddr:10.xx.xx.10 netmask:255.255.255.0
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -stat -g
回显如下:
packet statistics: mac_tx_mac_pause_num:0 mac_rx_mac_pause_num:0 mac_tx_pfc_pkt_num:0 ... roce_qp_status_err_num:0 nic_tx_all_pkg_num:122404 nic_tx_all_oct_num:16921741 nic_rx_all_pkg_num:6414803 nic_rx_all_oct_num:482237805
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link_stat -g
回显如下:
[device 0]current time : Wed Jun 7 10:08:28 2023 [device 0]link up count : 2 [device 0]link change records : [device 0] Tue Jun 6 16:32:12 2023 LINK UP [device 0] Tue Jun 6 16:32:10 2023 LINK DOWN [device 0] Tue Jun 6 16:31:55 2023 LINK UP
文件存储示例如下,示例仅为0卡存储示例,请用户采集所有卡的信息。
/usr/local/Ascend/driver/tools/hccn_tool -i 0 -net_health -g net health status: Init /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link -g link status: UP /usr/local/Ascend/driver/tools/hccn_tool -i 0 -optical -g optical info: present : not present ... Tx Power : 4.4035 mW Rx Power : 1.0189 mW Vcc High Thres : 3465.00 mV Vcc Low Thres : 3135.00 mV Temp High Thres : 70 C Temp Low Thres : 0 C TxPower High Thres : 3.5481 mW TxPower Low Thres : 0.2818 mW RxPower High Thres : 3.5481 mW RxPower Low Thres : 0.1445 mW Tx Bias : 7.9360 mA Tx Los Flag : 0x0 Rx Los Flag : 0xff Tx LoL Flag : 0x0 Rx LoL Flag : 0xff ... /usr/local/Ascend/driver/tools/hccn_tool -i 0 -tls -g | grep switch dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days /usr/local/Ascend/driver/tools/hccn_tool -i 0 -fec -g fec mode: rs FEC mode /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g ipaddr:10.xx.xx.10 netmask:255.255.255.0 /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g packet statistics: mac_tx_mac_pause_num:0 mac_rx_mac_pause_num:0 mac_tx_pfc_pkt_num:0 ... roce_qp_status_err_num:0 nic_tx_all_pkg_num:122404 nic_tx_all_oct_num:16921741 nic_rx_all_pkg_num:6414803 nic_rx_all_oct_num:482237805 /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link_stat -g [device 0]current time : Wed Jun 7 10:08:28 2023 [device 0]link up count : 2 [device 0]link change records : [device 0] Tue Jun 6 16:32:12 2023 LINK UP [device 0] Tue Jun 6 16:32:10 2023 LINK DOWN [device 0] Tue Jun 6 16:31:55 2023 LINK UP
/usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g XXXX /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g
/usr/local/bin/npu-smi info
+------------------------------------------------------------------------------------------------+ | npu-smi 24.1.rc1 Version: 24.1.rc1 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ ... +===========================+===============+====================================================+ | 7 xxx | OK | 67.0 44 0 / 0 | | 0 | 0000:3D:00.0 | 0 2505 / 15567 0 / 32768 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 0 | +===========================+===============+====================================================+ ... | No running processes found in NPU 7 | +===========================+===============+====================================================+
/usr/local/bin/npu-smi info -i ${device_id} -t ecc
NPU ID : 1 Chip Count : 1 DDR Single Bit Error Count : 0 DDR Double Bit Error Count : 0 DDR Single Bit Aggregate Total Err Cnt : 0 DDR Double Bit Aggregate Total Err Cnt : 0 DDR Single Bit Isolated Pages Count : 0 DDR Double Bit Isolated Pages Count : 0 HBM Single Bit Error Count : 0 HBM Double Bit Error Count : 0 HBM Single Bit Aggregate Total Err Cnt : 0 HBM Double Bit Aggregate Total Err Cnt : 0 HBM Single Bit Isolated Pages Count : 0 HBM Double Bit Isolated Pages Count : 0 Chip ID : 0
/usr/local/bin/npu-smi info -i ${device_id} -t board
NPU ID : 0 Software Version : 23.0.5 Firmware Version : 7.1.0.7.220 Compatibility : OK Board ID : 0x02 PCB ID : A BOM ID : 1 PCIe Bus Info : 0000:61:00.0 Slot ID : 0 Class ID : NA PCI Vendor ID : 0x19E5 PCI Device ID : 0xD801 Subsystem Vendor ID : 0x0200 Subsystem Device ID : 0x0100 Chip Count : 1
/usr/local/bin/npu-smi info -i ${device_id} -c 0 -t board
NPU ID : 0 Chip ID : 0 Chip Type : Ascend Chip Name : xxx Chip Version : V1 Board ID : 0x02 PCB ID : NA BOM ID : 1 VDie ID : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003 NDie ID : 27216594 20401010 4E10C8D4 14CC040A A4102003 Chip Position ID : 0 PCIe Bus Info : 0000:61:00.0 Firmware Version : 7.1.0.7.220
/usr/local/bin/npu-smi info -i ${device_id} -t usages
NPU ID : 0 Chip Count : 1 DDR Capacity(MB) : 13553 DDR Usage Rate(%) : 6 DDR Hugepages Total(page) : 0 DDR Hugepages Usage Rate(%) : 0 HBM Capacity(MB) : 32768 HBM Usage Rate(%) : 0 Aicore Usage Rate(%) : 0 Aicpu Usage Rate(%) : 0 Ctrlcpu Usage Rate(%) : 0 DDR Bandwidth Usage Rate(%) : 0 HBM Bandwidth Usage Rate(%) : 0 Chip ID : 0
/usr/local/bin/npu-smi info -i ${device_id} -c 0 -t health
Health Status : OK Error Code : NA Error Information : NA
/usr/local/bin/npu-smi info +------------------------------------------------------------------------------------------------+ | npu-smi 23.0.5 Version: 23.0.5 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 0 xxx | OK | 73.1 37 0 / 0 | | 0 | 0000:61:00.0 | 0 920 / 13553 0 / 32768 | +===========================+===============+====================================================+ ... +===========================+===============+====================================================+ | 7 xxx | OK | 67.0 38 0 / 0 | | 0 | 0000:3D:00.0 | 0 2346 / 15567 0 / 32768 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 0 | +===========================+===============+====================================================+ ... +===========================+===============+====================================================+ | No running processes found in NPU 7 | +===========================+===============+====================================================+ /usr/local/bin/npu-smi info -i 0 -c 0 -t health Health Status : OK Error Code : NA Error Information : NA /usr/local/bin/npu-smi info -i 0 -t ecc NPU ID : 0 Chip Count : 1 DDR Single Bit Error Count : 0 DDR Double Bit Error Count : 0 DDR Single Bit Aggregate Total Err Cnt : 0 DDR Double Bit Aggregate Total Err Cnt : 0 DDR Single Bit Isolated Pages Count : 0 DDR Double Bit Isolated Pages Count : 0 HBM Single Bit Error Count : 0 HBM Double Bit Error Count : 0 HBM Single Bit Aggregate Total Err Cnt : 0 HBM Double Bit Aggregate Total Err Cnt : 0 HBM Single Bit Isolated Pages Count : 0 HBM Double Bit Isolated Pages Count : 0 Chip ID : 0 /usr/local/bin/npu-smi info -i 0 -t board NPU ID : 0 Software Version : 23.0.5 Firmware Version : 7.1.0.7.220 Compatibility : OK Board ID : 0x02 PCB ID : A BOM ID : 1 PCIe Bus Info : 0000:61:00.0 Slot ID : 0 Class ID : NA PCI Vendor ID : 0x19E5 PCI Device ID : 0xD801 Subsystem Vendor ID : 0x0200 Subsystem Device ID : 0x0100 Chip Count : 1 /usr/local/bin/npu-smi info -i 0 -c 0 -t board NPU ID : 0 Chip ID : 0 Chip Type : Ascend Chip Name : xxx Chip Version : V1 Board ID : 0x02 PCB ID : NA BOM ID : 1 VDie ID : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003 NDie ID : 27216594 20401010 4E10C8D4 14CC040A A4102003 Chip Position ID : 0 PCIe Bus Info : 0000:61:00.0 Firmware Version : 7.1.0.7.220 /usr/local/bin/npu-smi info -i 0 -t usages NPU ID : 0 Chip Count : 1 DDR Capacity(MB) : 13553 DDR Usage Rate(%) : 6 DDR Hugepages Total(page) : 0 DDR Hugepages Usage Rate(%) : 0 HBM Capacity(MB) : 32768 HBM Usage Rate(%) : 0 Aicore Usage Rate(%) : 0 Aicpu Usage Rate(%) : 0 Ctrlcpu Usage Rate(%) : 0 DDR Bandwidth Usage Rate(%) : 0 HBM Bandwidth Usage Rate(%) : 0 Chip ID : 0 /usr/local/bin/npu-smi info -i 0 -c 0 -t health Health Status : OK Error Code : NA Error Information : NA ...
/usr/local/bin/npu-smi info -i 0 -c 0 -t health XXXX /usr/local/bin/npu-smi info -i 1 -c 0 -t health
datetime=$(date "+%Y-%m-%d %H:%M:%S") echo "Datetime: $datetime">>${save_file} echo -e "\n">>${save_file}
Datetime: 2024-06-26 01:13:36
cat /usr/local/Ascend/driver/version.info
Version=24.1.rc1 ascendhal_version=7.35.19 aicpu_version=1.0 tdt_version=1.0 log_version=1.0 prof_version=2.0 dvppkernels_version=1.1 tsfw_version=1.0 Innerversion=V100R001C15SPC006B220 compatible_version=[V100R001C30],[V100R001C13],[V100R001C15],[V100R001C17] compatible_version_fw=[7.0.0,7.2.99]
cat /usr/local/Ascend/firmware/version.info
回显如下:
Version=7.1.0.11.220 firmware_version=1.0 package_version=23.0.7 compatible_version_drv=[23.0.rc3,23.0.rc3.],[23.0.0,23.0.0.]
cat /usr/local/Ascend/nnae/latest/ascend_nnae_install.info
回显如下:
package_name=Ascend-cann-nnae version=8.0.RC3 innerversion=V100R001C19SPC001B137 compatible_version=[V100R001C13,V100R001C19],[V100R001C30] arch=x86_64 os=linux path=/usr/local/Ascend/nnae/8.0.RC3
cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
回显如下:
package_name=Ascend-cann-toolkit version=7.0.T10 innerversion=V100R001C13B222 compatible_version=[V100R001C29],[V100R001C30],[V100R001C13],[V100R003C10],[V100R003C11] arch=aarch64 os=linux path=/usr/local/Ascend/ascend-toolkit/7.0.T10/aarch64-linux
cat /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/ascend_toolkit_install.info
回显如下:
package_name=Ascend-cann-toolkit version=8.0.0 innerversion=V100R001C20B053 compatible_version=[V100R001C15],[V100R001C17],[V100R001C18],[V100R001C19],[V100R001C20] arch=x86_64 os=linux path=/usr/local/Ascend/ascend-toolkit/8.0.0/x86_64-linux
pip list | grep "torch " pip list | grep torch-npu pip list | grep "mindspore "
回显如下:
torch 1.11.0 torch-npu 2.1.0.post8.dev20241009 mindspore 2.3.0
/usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version
{ Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(0). {"device_id":0, "component":nve, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(3). {"device_id":0, "component":uefi, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(8). {"device_id":0, "component":imu, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(9). {"device_id":0, "component":imp, "version":7.1.0.7.220} … Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(0). {"device_id":7, "component":nve, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(3). {"device_id":7, "component":uefi, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(8). {"device_id":7, "component":imu, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(9). {"device_id":7, "component":imp, "version":7.1.0.7.220} }