NPU Environment Check File Before Training or Inference

File Description

  • Before a training or inference job is started, use hccn_tool or an automation script to query and record the IP address, mask, statistics on received and sent packets, and historical link statistics of each NPU network port. Before the training is started, use npu-smi or a script to query the processor health information.
  • Naming constraint: npu_info_before.txt
  • Constraints on the storage path:

Collection Mode Description

MindCluster Ascend FaultDiag can collect logs before training or inference in either of the following ways:

  • Script-based collection: Use npu_info_collect.sh to collect the NPU environment check file before training or inference. For details, see Log Collection Scripts.
  • CLI-based collection: Use hccn_tool tool to query each NPU environment check file before training or inference and save the query command and result to the npu_info_before.txt file.

CLI-based Collection

The involved commands and examples are as follows:
  • Query the network health status.
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -net_health -g
    Command output:
    net health status: Init
  • Query the RoCE physical link connection status.
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link -g
    Command output:
    link status: UP
  • Query information about the RoCE network optical module.
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -optical -g
    Command output:
    optical info:
    present              : not present
    ...
    Tx Power             : 4.4035 mW
    Rx Power             : 1.0189 mW
    Vcc High Thres       : 3465.00 mV
    Vcc Low Thres        : 3135.00 mV
    Temp High Thres      : 70 C
    Temp Low Thres       : 0 C
    TxPower High Thres   : 3.5481 mW
    TxPower Low Thres    : 0.2818 mW
    RxPower High Thres   : 3.5481 mW
    RxPower Low Thres    : 0.1445 mW
    Tx Bias              : 7.9360 mA
    Tx Los Flag          : 0x0
    Rx Los Flag          : 0xff
    Tx LoL Flag          : 0x0
    Rx LoL Flag          : 0xff
    ...
  • Query the TLS switch configuration.
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -tls -g | grep switch
    Command output:
    dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
  • Query the FEC mode.
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -fec -g
    Command output:
    fec mode: rs FEC mode
  • Query the IP address and mask.
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -ip -g

    Command output:

    ipaddr:10.xx.xx.10
    netmask:255.255.255.0
  • Query statistics about sent and received packets.
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -stat -g

    Command output:

    packet statistics:
    mac_tx_mac_pause_num:0
    mac_rx_mac_pause_num:0
    mac_tx_pfc_pkt_num:0
    ...
    roce_qp_status_err_num:0
    nic_tx_all_pkg_num:122404
    nic_tx_all_oct_num:16921741
    nic_rx_all_pkg_num:6414803
    nic_rx_all_oct_num:482237805
  • Query the historical link statistics of the network port.
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link_stat -g

    Command output:

    [device 0]current time        : Wed Jun  7 10:08:28 2023
    [device 0]link up count       : 2
    [device 0]link change records :
    [device 0]    Tue Jun  6 16:32:12 2023    LINK UP
    [device 0]    Tue Jun  6 16:32:10 2023    LINK DOWN
    [device 0]    Tue Jun  6 16:31:55 2023    LINK UP

    The following is an example of file storage of device 0. You need to collect information about all devices.

    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -net_health -g
    net health status: Init
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link -g
    link status: UP
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -optical -g
    optical info:
    present              : not present
    ...
    Tx Power             : 4.4035 mW
    Rx Power             : 1.0189 mW
    Vcc High Thres       : 3465.00 mV
    Vcc Low Thres        : 3135.00 mV
    Temp High Thres      : 70 C
    Temp Low Thres       : 0 C
    TxPower High Thres   : 3.5481 mW
    TxPower Low Thres    : 0.2818 mW
    RxPower High Thres   : 3.5481 mW
    RxPower Low Thres    : 0.1445 mW
    Tx Bias              : 7.9360 mA
    Tx Los Flag          : 0x0
    Rx Los Flag          : 0xff
    Tx LoL Flag          : 0x0
    Rx LoL Flag          : 0xff
    ...
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -tls -g | grep switch
    dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -fec -g
    fec mode: rs FEC mode
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g
    ipaddr:10.xx.xx.10
    netmask:255.255.255.0
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g
    packet statistics:
    mac_tx_mac_pause_num:0
    mac_rx_mac_pause_num:0
    mac_tx_pfc_pkt_num:0
    ...
    roce_qp_status_err_num:0
    nic_tx_all_pkg_num:122404
    nic_tx_all_oct_num:16921741
    nic_rx_all_pkg_num:6414803
    nic_rx_all_oct_num:482237805
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link_stat -g
    [device 0]current time        : Wed Jun  7 10:08:28 2023
    [device 0]link up count       : 2
    [device 0]link change records :
    [device 0]    Tue Jun  6 16:32:12 2023    LINK UP
    [device 0]    Tue Jun  6 16:32:10 2023    LINK DOWN
    [device 0]    Tue Jun  6 16:31:55 2023    LINK UP
    The result of each collection command must be separated by one line. Example:
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g
    XXXX
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g
  • Use npu-smi to query the processor health information before training or inference and save the query command and result to the npu_info_before.txt file. The involved commands and examples are as follows:
    • Query the basic information about the device.
      /usr/local/bin/npu-smi info
      Command output:
      +------------------------------------------------------------------------------------------------+
      | npu-smi 24.1.rc1                 Version: 24.1.rc1                                             |
      +---------------------------+---------------+----------------------------------------------------+
      | NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
      | Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
      +===========================+===============+====================================================+
      ...
      +===========================+===============+====================================================+
      | 7     xxx                | OK            | 67.0        44                0    / 0             |
      | 0                         | 0000:3D:00.0  | 0           2505 / 15567      0    / 32768         |
      +===========================+===============+====================================================+
      +---------------------------+---------------+----------------------------------------------------+
      | NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
      +===========================+===============+====================================================+
      | No running processes found in NPU 0                                                            |
      +===========================+===============+====================================================+
      ...
      | No running processes found in NPU 7                                                            |
      +===========================+===============+====================================================+
    • Query ECC of the high-bandwidth memory.
      /usr/local/bin/npu-smi info -i ${device_id} -t ecc
      Command output:
      NPU ID                                   : 1
      Chip Count                               : 1
      
      DDR Single Bit Error Count               : 0
      DDR Double Bit Error Count               : 0
      DDR Single Bit Aggregate Total Err Cnt   : 0
      DDR Double Bit Aggregate Total Err Cnt   : 0
      DDR Single Bit Isolated Pages Count      : 0
      DDR Double Bit Isolated Pages Count      : 0
      HBM Single Bit Error Count               : 0
      HBM Double Bit Error Count               : 0
      HBM Single Bit Aggregate Total Err Cnt   : 0
      HBM Double Bit Aggregate Total Err Cnt   : 0
      HBM Single Bit Isolated Pages Count      : 0
      HBM Double Bit Isolated Pages Count      : 0
      Chip ID                                  : 0
    • Query the basic information about the hardware.
      /usr/local/bin/npu-smi info -i ${device_id} -t board
      Command output:
      NPU ID                         : 0
      Software Version               : 23.0.5
      Firmware Version               : 7.1.0.7.220
      Compatibility                  : OK
      Board ID                       : 0x02
      PCB ID                         : A
      BOM ID                         : 1
      PCIe Bus Info                  : 0000:61:00.0
      Slot ID                        : 0
      Class ID                       : NA
      PCI Vendor ID                  : 0x19e5
      PCI Device ID                  : 0xD801
      Subsystem Vendor ID            : 0x0200
      Subsystem Device ID            : 0x0100
      Chip Count                     : 1
    • Query the basic hardware information and the name of the specified device.
      /usr/local/bin/npu-smi info -i ${device_id} -c 0 -t board
      Command output:
      NPU ID                         : 0
      Chip ID                        : 0
      Chip Type                      : Ascend
      Chip Name                      : xxx
      Chip Version                   : V1
      Board ID                       : 0x02
      PCB ID                         : NA
      BOM ID                         : 1
      VDie ID                        : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003
      NDie ID                        : 27216594 20401010 4E10C8D4 14CC040A A4102003
      Chip Position ID               : 0
      PCIe Bus Info                  : 0000:61:00.0
      Firmware Version               : 7.1.0.7.220
    • Query the memory usage.
      /usr/local/bin/npu-smi info -i ${device_id} -t usages
      Command output:
      NPU ID                         : 0
      Chip Count                     : 1
      
      DDR Capacity(MB)               : 13553
      DDR Usage Rate(%)              : 6
      DDR Hugepages Total(page)      : 0
      DDR Hugepages Usage Rate(%)    : 0
      HBM Capacity(MB)               : 32768
      HBM Usage Rate(%)              : 0
      Aicore Usage Rate(%)           : 0
      Aicpu Usage Rate(%)            : 0
      Ctrlcpu Usage Rate(%)          : 0
      DDR Bandwidth Usage Rate(%)    : 0
      HBM Bandwidth Usage Rate(%)    : 0
      Chip ID                        : 0
    • Query the processor health information.
      /usr/local/bin/npu-smi info -i ${device_id} -c 0 -t health
      Command output:
       Health Status                  : OK
       Error Code                     : NA
       Error Information              : NA
      The following is an example of file storage. You need to collect information about all devices.
      /usr/local/bin/npu-smi info
      +------------------------------------------------------------------------------------------------+
      | npu-smi 23.0.5                   Version: 23.0.5                                               |
      +---------------------------+---------------+----------------------------------------------------+
      | NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
      | Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
      +===========================+===============+====================================================+
      | 0     xxx                 | OK            | 73.1        37                0    / 0             |
      | 0                         | 0000:61:00.0  | 0           920  / 13553      0    / 32768         |
      +===========================+===============+====================================================+
      ...
      +===========================+===============+====================================================+
      | 7     xxx                 | OK            | 67.0        38                0    / 0             |
      | 0                         | 0000:3D:00.0  | 0           2346 / 15567      0    / 32768         |
      +===========================+===============+====================================================+
      +---------------------------+---------------+----------------------------------------------------+
      | NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
      +===========================+===============+====================================================+
      | No running processes found in NPU 0                                                            |
      +===========================+===============+====================================================+
      ...
      +===========================+===============+====================================================+
      | No running processes found in NPU 7                                                            |
      +===========================+===============+====================================================+
      
      /usr/local/bin/npu-smi info -i 0 -c 0 -t health
      Health Status                  : OK
      Error Code                     : NA
      Error Information              : NA
      
      /usr/local/bin/npu-smi info -i 0 -t ecc
      NPU ID                                   : 0
      Chip Count                               : 1
      
      DDR Single Bit Error Count               : 0
      DDR Double Bit Error Count               : 0
      DDR Single Bit Aggregate Total Err Cnt   : 0
      DDR Double Bit Aggregate Total Err Cnt   : 0
      DDR Single Bit Isolated Pages Count      : 0
      DDR Double Bit Isolated Pages Count      : 0
      HBM Single Bit Error Count               : 0
      HBM Double Bit Error Count               : 0
      HBM Single Bit Aggregate Total Err Cnt   : 0
      HBM Double Bit Aggregate Total Err Cnt   : 0
      HBM Single Bit Isolated Pages Count      : 0
      HBM Double Bit Isolated Pages Count      : 0
      Chip ID                                  : 0
      
      /usr/local/bin/npu-smi info -i 0 -t board
      NPU ID                         : 0
      Software Version               : 23.0.5
      Firmware Version               : 7.1.0.7.220
      Compatibility                  : OK
      Board ID                       : 0x02
      PCB ID                         : A
      BOM ID                         : 1
      PCIe Bus Info                  : 0000:61:00.0
      Slot ID                        : 0
      Class ID                       : NA
      PCI Vendor ID                  : 0x19e5
      PCI Device ID                  : 0xD801
      Subsystem Vendor ID            : 0x0200
      Subsystem Device ID            : 0x0100
      Chip Count                     : 1
      
      /usr/local/bin/npu-smi info -i 0 -c 0 -t board
      NPU ID                         : 0
      Chip ID                        : 0
      Chip Type                      : Ascend
      Chip Name                      : xxx
      Chip Version                   : V1
      Board ID                       : 0x02
      PCB ID                         : NA
      BOM ID                         : 1
      VDie ID                        : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003
      NDie ID                        : 27216594 20401010 4E10C8D4 14CC040A A4102003
      Chip Position ID               : 0
      PCIe Bus Info                  : 0000:61:00.0
      Firmware Version               : 7.1.0.7.220
      
      /usr/local/bin/npu-smi info -i 0 -t usages
      NPU ID                         : 0
      Chip Count                     : 1
      
      DDR Capacity(MB)               : 13553
      DDR Usage Rate(%)              : 6
      DDR Hugepages Total(page)      : 0
      DDR Hugepages Usage Rate(%)    : 0
      HBM Capacity(MB)               : 32768
      HBM Usage Rate(%)              : 0
      Aicore Usage Rate(%)           : 0
      Aicpu Usage Rate(%)            : 0
      Ctrlcpu Usage Rate(%)          : 0
      DDR Bandwidth Usage Rate(%)    : 0
      HBM Bandwidth Usage Rate(%)    : 0
      Chip ID                        : 0
      
      /usr/local/bin/npu-smi info -i 0 -c 0 -t health
       Health Status                  : OK
       Error Code                     : NA
       Error Information              : NA
      ...
    The result of each collection command must be separated by one line. Example:
    /usr/local/bin/npu-smi info -i 0 -c 0 -t health
    XXXX
    
    /usr/local/bin/npu-smi info -i 1 -c 0 -t health
  • Use other commands to query each NPU environment check file before training or inference and save the query command and result to the npu_info_before.txt file. The involved commands and examples are as follows:
    • Query the current system time.
      datetime=$(date "+%Y-%m-%d %H:%M:%S")
      echo "Datetime: $datetime">>${save_file}
      echo -e "\n">>${save_file}
      Command output:
      Datetime: 2024-06-26 01:13:36
    • Query the driver version.
      cat /usr/local/Ascend/driver/version.info
      Command output:
      Version=24.1.rc1
      ascendhal_version=7.35.19
      aicpu_version=1.0
      tdt_version=1.0
      log_version=1.0
      prof_version=2.0
      dvppkernels_version=1.1
      tsfw_version=1.0
      Innerversion=V100R001C15SPC006B220
      compatible_version=[V100R001C30],[V100R001C13],[V100R001C15],[V100R001C17]
      compatible_version_fw=[7.0.0,7.2.99]
    • Query the firmware version.
      cat /usr/local/Ascend/firmware/version.info

      Command output:

      Version=7.1.0.11.220
      firmware_version=1.0
      package_version=23.0.7
      compatible_version_drv=[23.0.rc3,23.0.rc3.],[23.0.0,23.0.0.]
    • Query the CANN version (AArch64).
      cat /usr/local/Ascend/cann/aarch64-linux/ascend_toolkit_install.info

      Command output:

      package_name=Ascend-cann-toolkit
      version=8.5.0
      innerversion=V100R001C25SPC001B212
      compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
      arch=aarch64
      os=linux
      path=/usr/local/Ascend/cann-8.5.0/aarch64-linux
    • Query the CANN version (x86_64).
      cat /usr/local/Ascend/cann/x86_64-linux/ascend_toolkit_install.info

      Command output:

      package_name=Ascend-cann-toolkit
      version=8.5.0
      innerversion=V100R001C25SPC001B212
      compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
      arch=x86_64
      os=linux
      path=/usr/local/Ascend/cann-8.5.0/x86_64-linux
    • Query the AI framework version.
      pip list | grep "torch "
      pip list | grep torch-npu
      pip list | grep "mindspore "

      Command output:

      torch              1.11.0
      torch-npu          2.1.0.post8.dev20241009
      mindspore          2.3.0
    • Query the firmware version details.
      /usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version
      Command output:
      {
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(0).
      {"device_id":0, "component":nve, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(3).
      {"device_id":0, "component":uefi, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(8).
      {"device_id":0, "component":imu, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(9).
      {"device_id":0, "component":imp, "version":7.1.0.7.220}
      …
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(0).
      {"device_id":7, "component":nve, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(3).
      {"device_id":7, "component":uefi, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(8).
      {"device_id":7, "component":imu, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(9).
      {"device_id":7, "component":imp, "version":7.1.0.7.220}
      }