故障诊断工具支持通过以下方式采集训练及推理任务完成后NPU环境检查文件:
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -net_health -g
net health status: Init
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link -g
link status: UP
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -optical -g | grep prese
present : present
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -tls -g | grep switch
dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -fec -g
fec mode: rs FEC mode
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -ip -g
回显如下:
ipaddr:10.xx.xx.10 netmask:255.255.255.0
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -stat -g
回显如下:
1 2 3 4 5 6 7 8 9 10 | packet statistics: mac_tx_mac_pause_num:0 mac_rx_mac_pause_num:0 mac_tx_pfc_pkt_num:0 ... roce_qp_status_err_num:0 nic_tx_all_pkg_num:122404 nic_tx_all_oct_num:16921741 nic_rx_all_pkg_num:6414803 nic_rx_all_oct_num:482237805 |
/usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link_stat -g
回显如下:
1 2 3 4 5 6 | [device 0]current time : Wed Jun 7 10:08:28 2023 [device 0]link up count : 2 [device 0]link change records : [device 0] Tue Jun 6 16:32:12 2023 LINK UP [device 0] Tue Jun 6 16:32:10 2023 LINK DOWN [device 0] Tue Jun 6 16:31:55 2023 LINK UP |
文件存储存示例如下,示例仅为0卡存储示例,请用户采集所有卡的信息。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | /usr/local/Ascend/driver/tools/hccn_tool -i 0 -net_health -g net health status: Init /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link -g link status: UP /usr/local/Ascend/driver/tools/hccn_tool -i 0 -optical -g | grep prese present : present /usr/local/Ascend/driver/tools/hccn_tool -i 0 -tls -g | grep switch dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days /usr/local/Ascend/driver/tools/hccn_tool -i 0 -fec -g fec mode: rs FEC mode /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g ipaddr:10.xx.xx.10 netmask:255.255.255.0 /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g packet statistics: mac_tx_mac_pause_num:0 mac_rx_mac_pause_num:0 mac_tx_pfc_pkt_num:0 ... roce_qp_status_err_num:0 nic_tx_all_pkg_num:122404 nic_tx_all_oct_num:16921741 nic_rx_all_pkg_num:6414803 nic_rx_all_oct_num:482237805 /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link_stat -g [device 0]current time : Wed Jun 7 10:08:28 2023 [device 0]link up count : 2 [device 0]link change records : [device 0] Tue Jun 6 16:32:12 2023 LINK UP [device 0] Tue Jun 6 16:32:10 2023 LINK DOWN [device 0] Tue Jun 6 16:31:55 2023 LINK UP |
1 2 3 4 | /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g XXXX /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g |
/usr/local/bin/npu-smi info
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | +------------------------------------------------------------------------------------------------+ | npu-smi 24.1.rc1 Version: 24.1.rc1 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ ... +===========================+===============+====================================================+ | 7 xxx | OK | 67.0 44 0 / 0 | | 0 | 0000:3D:00.0 | 0 2505 / 15567 0 / 32768 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 0 | +===========================+===============+====================================================+ ... | No running processes found in NPU 7 | +===========================+===============+====================================================+ |
/usr/local/bin/npu-smi info -i ${device_id} -t ecc
1 2 | NPU ID : 1 Chip Count : 1 |
/usr/local/bin/npu-smi info -i ${device_id} -t board
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | NPU ID : 0 Software Version : 23.0.5 Firmware Version : 7.1.0.7.220 Compatibility : OK Board ID : 0x02 PCB ID : A BOM ID : 1 PCIe Bus Info : 0000:61:00.0 Slot ID : 0 Class ID : NA PCI Vendor ID : 0x19E5 PCI Device ID : 0xD801 Subsystem Vendor ID : 0x0200 Subsystem Device ID : 0x0100 Chip Count : 1 |
/usr/local/bin/npu-smi info -i ${device_id} -c 0 -t board
1 2 3 4 5 6 7 8 9 10 11 12 13 | NPU ID : 0 Chip ID : 0 Chip Type : Ascend Chip Name : xxx Chip Version : V1 Board ID : 0x02 PCB ID : NA BOM ID : 1 VDie ID : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003 NDie ID : 27216594 20401010 4E10C8D4 14CC040A A4102003 Chip Position ID : 0 PCIe Bus Info : 0000:61:00.0 Firmware Version : 7.1.0.7.220 |
/usr/local/bin/npu-smi info -i ${device_id} -t usages
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | NPU ID : 0 Chip Count : 1 DDR Capacity(MB) : 13553 DDR Usage Rate(%) : 6 DDR Hugepages Total(page) : 0 DDR Hugepages Usage Rate(%) : 0 HBM Capacity(MB) : 32768 HBM Usage Rate(%) : 0 Aicore Usage Rate(%) : 0 Aicpu Usage Rate(%) : 0 Ctrlcpu Usage Rate(%) : 0 DDR Bandwidth Usage Rate(%) : 0 HBM Bandwidth Usage Rate(%) : 0 Chip ID : 0 |
/usr/local/bin/npu-smi info -i ${device_id} -c 0 -t health
1 2 3 | Health Status : OK Error Code : NA Error Information : NA |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | /usr/local/bin/npu-smi info +------------------------------------------------------------------------------------------------+ | npu-smi 23.0.5 Version: 23.0.5 | +---------------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +===========================+===============+====================================================+ | 0 xxx | OK | 73.1 37 0 / 0 | | 0 | 0000:61:00.0 | 0 920 / 13553 0 / 32768 | +===========================+===============+====================================================+ ... +===========================+===============+====================================================+ | 7 xxx | OK | 67.0 38 0 / 0 | | 0 | 0000:3D:00.0 | 0 2346 / 15567 0 / 32768 | +===========================+===============+====================================================+ +---------------------------+---------------+----------------------------------------------------+ | NPU Chip | Process id | Process name | Process memory(MB) | +===========================+===============+====================================================+ | No running processes found in NPU 0 | +===========================+===============+====================================================+ ... +===========================+===============+====================================================+ | No running processes found in NPU 7 | +===========================+===============+====================================================+ /usr/local/bin/npu-smi info -i 0 -c 0 -t health Health Status : OK Error Code : NA Error Information : NA /usr/local/bin/npu-smi info -i 0 -t ecc NPU ID : 0 Chip Count : 1 DDR Single Bit Error Count : 0 DDR Double Bit Error Count : 0 DDR Single Bit Aggregate Total Err Cnt : 0 DDR Double Bit Aggregate Total Err Cnt : 0 DDR Single Bit Isolated Pages Count : 0 DDR Double Bit Isolated Pages Count : 0 HBM Single Bit Error Count : 0 HBM Double Bit Error Count : 0 HBM Single Bit Aggregate Total Err Cnt : 0 HBM Double Bit Aggregate Total Err Cnt : 0 HBM Single Bit Isolated Pages Count : 0 HBM Double Bit Isolated Pages Count : 0 Chip ID : 0 /usr/local/bin/npu-smi info -i 0 -t board NPU ID : 0 Software Version : 23.0.5 Firmware Version : 7.1.0.7.220 Compatibility : OK Board ID : 0x02 PCB ID : A BOM ID : 1 PCIe Bus Info : 0000:61:00.0 Slot ID : 0 Class ID : NA PCI Vendor ID : 0x19E5 PCI Device ID : 0xD801 Subsystem Vendor ID : 0x0200 Subsystem Device ID : 0x0100 Chip Count : 1 /usr/local/bin/npu-smi info -i 0 -c 0 -t board NPU ID : 0 Chip ID : 0 Chip Type : Ascend Chip Name : xxx Chip Version : V1 Board ID : 0x02 PCB ID : NA BOM ID : 1 VDie ID : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003 NDie ID : 27216594 20401010 4E10C8D4 14CC040A A4102003 Chip Position ID : 0 PCIe Bus Info : 0000:61:00.0 Firmware Version : 7.1.0.7.220 /usr/local/bin/npu-smi info -i 0 -t usages NPU ID : 0 Chip Count : 1 DDR Capacity(MB) : 13553 DDR Usage Rate(%) : 6 DDR Hugepages Total(page) : 0 DDR Hugepages Usage Rate(%) : 0 HBM Capacity(MB) : 32768 HBM Usage Rate(%) : 0 Aicore Usage Rate(%) : 0 Aicpu Usage Rate(%) : 0 Ctrlcpu Usage Rate(%) : 0 DDR Bandwidth Usage Rate(%) : 0 HBM Bandwidth Usage Rate(%) : 0 Chip ID : 0 /usr/local/bin/npu-smi info -i 0 -c 0 -t health Health Status : OK Error Code : NA Error Information : NA ... |
1 2 3 4 | /usr/local/bin/npu-smi info -i 0 -c 0 -t health XXXX /usr/local/bin/npu-smi info -i 1 -c 0 -t health |
1 2 3 | datetime=$(date "+%Y-%m-%d %H:%M:%S") echo "Datetime: $datetime">>${save_file} echo -e "\n">>${save_file} |
Datetime: 2024-06-26 01:13:36
cat /usr/local/Ascend/driver/version.info
1 2 3 4 5 6 7 8 9 10 11 | Version=24.1.rc1 ascendhal_version=7.35.19 aicpu_version=1.0 tdt_version=1.0 log_version=1.0 prof_version=2.0 dvppkernels_version=1.1 tsfw_version=1.0 Innerversion=V100R001C15SPC006B220 compatible_version=[V100R001C30],[V100R001C13],[V100R001C15],[V100R001C17] compatible_version_fw=[7.0.0,7.2.99] |
cat /usr/local/Ascend/firmware/version.info
回显如下:
1 2 3 4 | Version=7.1.0.11.220 firmware_version=1.0 package_version=23.0.7 compatible_version_drv=[23.0.rc3,23.0.rc3.],[23.0.0,23.0.0.] |
cat /usr/local/Ascend/nnae/latest/ascend_nnae_install.info
回显如下:
1 2 3 4 5 6 7 | package_name=Ascend-cann-nnae version=8.0.RC3 innerversion=V100R001C19SPC001B137 compatible_version=[V100R001C13,V100R001C19],[V100R001C30] arch=x86_64 os=linux path=/usr/local/Ascend/nnae/8.0.RC3 |
cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
回显如下:
1 2 3 4 5 6 7 | package_name=Ascend-cann-toolkit version=7.0.T10 innerversion=V100R001C13B222 compatible_version=[V100R001C29],[V100R001C30],[V100R001C13],[V100R003C10],[V100R003C11] arch=aarch64 os=linux path=/usr/local/Ascend/ascend-toolkit/7.0.T10/aarch64-linux |
cat /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/ascend_toolkit_install.info
回显如下:
1 2 3 4 5 6 7 | package_name=Ascend-cann-toolkit version=8.0.0 innerversion=V100R001C20B053 compatible_version=[V100R001C15],[V100R001C17],[V100R001C18],[V100R001C19],[V100R001C20] arch=x86_64 os=linux path=/usr/local/Ascend/ascend-toolkit/8.0.0/x86_64-linux |
1 2 3 | pip list | grep "torch " pip list | grep torch-npu pip list | grep "mindspore " |
回显如下:
1 2 3 | torch 1.11.0 torch-npu 2.1.0.post8.dev20241009 mindspore 2.3.0 |
/usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | { Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(0). {"device_id":0, "component":nve, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(3). {"device_id":0, "component":uefi, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(8). {"device_id":0, "component":imu, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(9). {"device_id":0, "component":imp, "version":7.1.0.7.220} … Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(0). {"device_id":7, "component":nve, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(3). {"device_id":7, "component":uefi, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(8). {"device_id":7, "component":imu, "version":7.1.0.7.220} Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(9). {"device_id":7, "component":imp, "version":7.1.0.7.220} } |