昇腾社区首页
中文
注册

训练及推理前NPU环境检查文件

文件说明

  • 训练及推理任务启动前,通过hccn_tool工具或自动化脚本进行查询,记录各NPU网口IP、掩码、收发报文统计、历史link统计信息。训练启动前,通过npu-smi工具或脚本进行查询芯片健康信息。
  • 命名约束:npu_info_before.txt。
  • 存放路径约束:

采集方式说明

故障诊断工具支持通过以下方式采集训练及推理前日志:

  • 脚本采集。在日志采集脚本中,使用npu_info_collect.sh脚本采集训练及推理前NPU环境检查文件。
  • 命令采集。在训练及推理前使用hccn_tool工具查询各NPU环境检查文件,并将查询指令和查询结果保存到npu_info_before.txt文件中。

命令采集

涉及命令及示例如下:
  • 执行以下命令,查询网络健康状态。
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -net_health -g
    回显如下:
    net health status: Init
  • 执行以下命令,查询RoCE物理链路连接状态。
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link -g
    回显如下:
    link status: UP
  • 执行以下命令,查询RoCE网络光模块信息。
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -optical -g | grep prese
    回显如下:
    present              : present
  • 执行以下命令,查询互联TLS开关配置。
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -tls -g | grep switch
    回显如下:
    dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
  • 执行以下命令,查询Fec模式信息。
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -fec -g
    回显如下:
    fec mode: rs FEC mode
  • 执行以下命令,查询IP及掩码信息。
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -ip -g

    回显如下:

    ipaddr:10.xx.xx.10
    netmask:255.255.255.0
  • 执行以下命令,查询收发报文统计信息。
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -stat -g

    回显如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    packet statistics:
    mac_tx_mac_pause_num:0
    mac_rx_mac_pause_num:0
    mac_tx_pfc_pkt_num:0
    ...
    roce_qp_status_err_num:0
    nic_tx_all_pkg_num:122404
    nic_tx_all_oct_num:16921741
    nic_rx_all_pkg_num:6414803
    nic_rx_all_oct_num:482237805
    
  • 执行以下命令,查询网口历史link统计信息。
    /usr/local/Ascend/driver/tools/hccn_tool -i ${device_id} -link_stat -g

    回显如下:

    1
    2
    3
    4
    5
    6
    [device 0]current time        : Wed Jun  7 10:08:28 2023
    [device 0]link up count       : 2
    [device 0]link change records :
    [device 0]    Tue Jun  6 16:32:12 2023    LINK UP
    [device 0]    Tue Jun  6 16:32:10 2023    LINK DOWN
    [device 0]    Tue Jun  6 16:31:55 2023    LINK UP
    

    文件存储示例如下,示例仅为0卡存储示例,请用户采集所有卡的信息。

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -net_health -g
    net health status: Init
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link -g
    link status: UP
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -optical -g | grep prese
    present              : present
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -tls -g | grep switch
    dev_id:0, tls switch[0](0:disable, 1:enable), tls preconfigured[1](0:non-preset, 1:preset), tls alarm time threshold[60]days
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -fec -g
    fec mode: rs FEC mode
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g
    ipaddr:10.xx.xx.10
    netmask:255.255.255.0
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g
    packet statistics:
    mac_tx_mac_pause_num:0
    mac_rx_mac_pause_num:0
    mac_tx_pfc_pkt_num:0
    ...
    roce_qp_status_err_num:0
    nic_tx_all_pkg_num:122404
    nic_tx_all_oct_num:16921741
    nic_rx_all_pkg_num:6414803
    nic_rx_all_oct_num:482237805
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -link_stat -g
    [device 0]current time        : Wed Jun  7 10:08:28 2023
    [device 0]link up count       : 2
    [device 0]link change records :
    [device 0]    Tue Jun  6 16:32:12 2023    LINK UP
    [device 0]    Tue Jun  6 16:32:10 2023    LINK DOWN
    [device 0]    Tue Jun  6 16:31:55 2023    LINK UP
    
    每条采集命令的结果之间需间隔1行。示例如下:
    1
    2
    3
    4
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -ip -g
    XXXX
    
    /usr/local/Ascend/driver/tools/hccn_tool -i 0 -stat -g
    
  • 训练及推理前使用npu-smi工具查询芯片健康信息,并将查询指令和查询结果保存到npu_info_before.txt文件中。涉及命令及示例如下:
    • 执行以下命令,查询设备的基础信息。
      /usr/local/bin/npu-smi info
      回显如下:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      +------------------------------------------------------------------------------------------------+
      | npu-smi 24.1.rc1                 Version: 24.1.rc1                                             |
      +---------------------------+---------------+----------------------------------------------------+
      | NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
      | Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
      +===========================+===============+====================================================+
      ...
      +===========================+===============+====================================================+
      | 7     xxx                | OK            | 67.0        44                0    / 0             |
      | 0                         | 0000:3D:00.0  | 0           2505 / 15567      0    / 32768         |
      +===========================+===============+====================================================+
      +---------------------------+---------------+----------------------------------------------------+
      | NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
      +===========================+===============+====================================================+
      | No running processes found in NPU 0                                                            |
      +===========================+===============+====================================================+
      ...
      | No running processes found in NPU 7                                                            |
      +===========================+===============+====================================================+
      
    • 执行以下命令,查询高带宽内存ECC计数信息。
      /usr/local/bin/npu-smi info -i ${device_id} -t ecc
      回显如下:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      NPU ID                                   : 1
      Chip Count                               : 1
      
      DDR Single Bit Error Count               : 0
      DDR Double Bit Error Count               : 0
      DDR Single Bit Aggregate Total Err Cnt   : 0
      DDR Double Bit Aggregate Total Err Cnt   : 0
      DDR Single Bit Isolated Pages Count      : 0
      DDR Double Bit Isolated Pages Count      : 0
      HBM Single Bit Error Count               : 0
      HBM Double Bit Error Count               : 0
      HBM Single Bit Aggregate Total Err Cnt   : 0
      HBM Double Bit Aggregate Total Err Cnt   : 0
      HBM Single Bit Isolated Pages Count      : 0
      HBM Double Bit Isolated Pages Count      : 0
      Chip ID                                  : 0
      
    • 执行以下命令,查询硬件基本信息。
      /usr/local/bin/npu-smi info -i ${device_id} -t board
      回显如下:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      NPU ID                         : 0
      Software Version               : 23.0.5
      Firmware Version               : 7.1.0.7.220
      Compatibility                  : OK
      Board ID                       : 0x02
      PCB ID                         : A
      BOM ID                         : 1
      PCIe Bus Info                  : 0000:61:00.0
      Slot ID                        : 0
      Class ID                       : NA
      PCI Vendor ID                  : 0x19E5
      PCI Device ID                  : 0xD801
      Subsystem Vendor ID            : 0x0200
      Subsystem Device ID            : 0x0100
      Chip Count                     : 1
      
    • 执行以下命令,查询硬件基本信息和指定卡的名称。
      /usr/local/bin/npu-smi info -i ${device_id} -c 0 -t board
      回显如下:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      NPU ID                         : 0
      Chip ID                        : 0
      Chip Type                      : Ascend
      Chip Name                      : xxx
      Chip Version                   : V1
      Board ID                       : 0x02
      PCB ID                         : NA
      BOM ID                         : 1
      VDie ID                        : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003
      NDie ID                        : 27216594 20401010 4E10C8D4 14CC040A A4102003
      Chip Position ID               : 0
      PCIe Bus Info                  : 0000:61:00.0
      Firmware Version               : 7.1.0.7.220
      
    • 执行以下命令,查询内存用量。
      /usr/local/bin/npu-smi info -i ${device_id} -t usages
      回显如下:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      NPU ID                         : 0
      Chip Count                     : 1
      
      DDR Capacity(MB)               : 13553
      DDR Usage Rate(%)              : 6
      DDR Hugepages Total(page)      : 0
      DDR Hugepages Usage Rate(%)    : 0
      HBM Capacity(MB)               : 32768
      HBM Usage Rate(%)              : 0
      Aicore Usage Rate(%)           : 0
      Aicpu Usage Rate(%)            : 0
      Ctrlcpu Usage Rate(%)          : 0
      DDR Bandwidth Usage Rate(%)    : 0
      HBM Bandwidth Usage Rate(%)    : 0
      Chip ID                        : 0
      
    • 执行以下命令,查询芯片健康信息。
      /usr/local/bin/npu-smi info -i ${device_id} -c 0 -t health
      回显如下:
      1
      2
      3
       Health Status                  : OK
       Error Code                     : NA
       Error Information              : NA
      
      文件存储示例如下,请用户采集所有卡的信息。
        1
        2
        3
        4
        5
        6
        7
        8
        9
       10
       11
       12
       13
       14
       15
       16
       17
       18
       19
       20
       21
       22
       23
       24
       25
       26
       27
       28
       29
       30
       31
       32
       33
       34
       35
       36
       37
       38
       39
       40
       41
       42
       43
       44
       45
       46
       47
       48
       49
       50
       51
       52
       53
       54
       55
       56
       57
       58
       59
       60
       61
       62
       63
       64
       65
       66
       67
       68
       69
       70
       71
       72
       73
       74
       75
       76
       77
       78
       79
       80
       81
       82
       83
       84
       85
       86
       87
       88
       89
       90
       91
       92
       93
       94
       95
       96
       97
       98
       99
      100
      101
      102
      /usr/local/bin/npu-smi info
      +------------------------------------------------------------------------------------------------+
      | npu-smi 23.0.5                   Version: 23.0.5                                               |
      +---------------------------+---------------+----------------------------------------------------+
      | NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
      | Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
      +===========================+===============+====================================================+
      | 0     xxx                | OK            | 73.1        37                0    / 0             |
      | 0                         | 0000:61:00.0  | 0           920  / 13553      0    / 32768         |
      +===========================+===============+====================================================+
      ...
      +===========================+===============+====================================================+
      | 7     xxx                | OK            | 67.0        38                0    / 0             |
      | 0                         | 0000:3D:00.0  | 0           2346 / 15567      0    / 32768         |
      +===========================+===============+====================================================+
      +---------------------------+---------------+----------------------------------------------------+
      | NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
      +===========================+===============+====================================================+
      | No running processes found in NPU 0                                                            |
      +===========================+===============+====================================================+
      ...
      +===========================+===============+====================================================+
      | No running processes found in NPU 7                                                            |
      +===========================+===============+====================================================+
      
      /usr/local/bin/npu-smi info -i 0 -c 0 -t health
      Health Status                  : OK
      Error Code                     : NA
      Error Information              : NA
      
      /usr/local/bin/npu-smi info -i 0 -t ecc
      NPU ID                                   : 0
      Chip Count                               : 1
      
      DDR Single Bit Error Count               : 0
      DDR Double Bit Error Count               : 0
      DDR Single Bit Aggregate Total Err Cnt   : 0
      DDR Double Bit Aggregate Total Err Cnt   : 0
      DDR Single Bit Isolated Pages Count      : 0
      DDR Double Bit Isolated Pages Count      : 0
      HBM Single Bit Error Count               : 0
      HBM Double Bit Error Count               : 0
      HBM Single Bit Aggregate Total Err Cnt   : 0
      HBM Double Bit Aggregate Total Err Cnt   : 0
      HBM Single Bit Isolated Pages Count      : 0
      HBM Double Bit Isolated Pages Count      : 0
      Chip ID                                  : 0
      
      /usr/local/bin/npu-smi info -i 0 -t board
      NPU ID                         : 0
      Software Version               : 23.0.5
      Firmware Version               : 7.1.0.7.220
      Compatibility                  : OK
      Board ID                       : 0x02
      PCB ID                         : A
      BOM ID                         : 1
      PCIe Bus Info                  : 0000:61:00.0
      Slot ID                        : 0
      Class ID                       : NA
      PCI Vendor ID                  : 0x19E5
      PCI Device ID                  : 0xD801
      Subsystem Vendor ID            : 0x0200
      Subsystem Device ID            : 0x0100
      Chip Count                     : 1
      
      /usr/local/bin/npu-smi info -i 0 -c 0 -t board
      NPU ID                         : 0
      Chip ID                        : 0
      Chip Type                      : Ascend
      Chip Name                      : xxx
      Chip Version                   : V1
      Board ID                       : 0x02
      PCB ID                         : NA
      BOM ID                         : 1
      VDie ID                        : 42C711D4 20B03704 4A10C8D4 14CC040A D2102003
      NDie ID                        : 27216594 20401010 4E10C8D4 14CC040A A4102003
      Chip Position ID               : 0
      PCIe Bus Info                  : 0000:61:00.0
      Firmware Version               : 7.1.0.7.220
      
      /usr/local/bin/npu-smi info -i 0 -t usages
      NPU ID                         : 0
      Chip Count                     : 1
      
      DDR Capacity(MB)               : 13553
      DDR Usage Rate(%)              : 6
      DDR Hugepages Total(page)      : 0
      DDR Hugepages Usage Rate(%)    : 0
      HBM Capacity(MB)               : 32768
      HBM Usage Rate(%)              : 0
      Aicore Usage Rate(%)           : 0
      Aicpu Usage Rate(%)            : 0
      Ctrlcpu Usage Rate(%)          : 0
      DDR Bandwidth Usage Rate(%)    : 0
      HBM Bandwidth Usage Rate(%)    : 0
      Chip ID                        : 0
      
      /usr/local/bin/npu-smi info -i 0 -c 0 -t health
       Health Status                  : OK
       Error Code                     : NA
       Error Information              : NA
      ...
      
    • 每条采集命令的结果之间需间隔1行。示例如下:
      1
      2
      3
      4
      /usr/local/bin/npu-smi info -i 0 -c 0 -t health
      XXXX
      
      /usr/local/bin/npu-smi info -i 1 -c 0 -t health
      
  • 在训练及推理前使用其他相关命令查询各NPU环境检查文件,并将查询指令和查询结果保存到npu_info_before.txt文件中。涉及命令及示例如下:
    • 执行以下命令,查询当前系统时间。
      1
      2
      3
      datetime=$(date "+%Y-%m-%d %H:%M:%S")
      echo "Datetime: $datetime">>${save_file}
      echo -e "\n">>${save_file}
      
      回显如下:
      Datetime: 2024-06-26 01:13:36
    • 执行以下命令,查询驱动版本号。
      cat /usr/local/Ascend/driver/version.info
      回显如下:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      Version=24.1.rc1
      ascendhal_version=7.35.19
      aicpu_version=1.0
      tdt_version=1.0
      log_version=1.0
      prof_version=2.0
      dvppkernels_version=1.1
      tsfw_version=1.0
      Innerversion=V100R001C15SPC006B220
      compatible_version=[V100R001C30],[V100R001C13],[V100R001C15],[V100R001C17]
      compatible_version_fw=[7.0.0,7.2.99]
      
    • 执行以下命令,查询固件版本号。
      cat /usr/local/Ascend/firmware/version.info

      回显如下:

      1
      2
      3
      4
      Version=7.1.0.11.220
      firmware_version=1.0
      package_version=23.0.7
      compatible_version_drv=[23.0.rc3,23.0.rc3.],[23.0.0,23.0.0.]
      
    • 执行以下命令,查询NNAE版本号。
      cat /usr/local/Ascend/nnae/latest/ascend_nnae_install.info

      回显如下:

      1
      2
      3
      4
      5
      6
      7
      package_name=Ascend-cann-nnae
      version=8.0.RC3
      innerversion=V100R001C19SPC001B137
      compatible_version=[V100R001C13,V100R001C19],[V100R001C30]
      arch=x86_64
      os=linux
      path=/usr/local/Ascend/nnae/8.0.RC3
      
    • 执行以下命令,查询CANN版本号(aarch64架构)。
      cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info

      回显如下:

      1
      2
      3
      4
      5
      6
      7
      package_name=Ascend-cann-toolkit
      version=7.0.T10
      innerversion=V100R001C13B222
      compatible_version=[V100R001C29],[V100R001C30],[V100R001C13],[V100R003C10],[V100R003C11]
      arch=aarch64
      os=linux
      path=/usr/local/Ascend/ascend-toolkit/7.0.T10/aarch64-linux
      
    • 执行以下命令,查询CANN版本号(x86_64架构)。
      cat /usr/local/Ascend/ascend-toolkit/latest/x86_64-linux/ascend_toolkit_install.info

      回显如下:

      1
      2
      3
      4
      5
      6
      7
      package_name=Ascend-cann-toolkit
      version=8.0.0
      innerversion=V100R001C20B053
      compatible_version=[V100R001C15],[V100R001C17],[V100R001C18],[V100R001C19],[V100R001C20]
      arch=x86_64
      os=linux
      path=/usr/local/Ascend/ascend-toolkit/8.0.0/x86_64-linux
      
    • 执行以下命令,查询AI框架版本号。
      1
      2
      3
      pip list | grep "torch "
      pip list | grep torch-npu
      pip list | grep "mindspore "
      

      回显如下:

      1
      2
      3
      torch              1.11.0
      torch-npu          2.1.0.post8.dev20241009
      mindspore          2.3.0
      
    • 执行以下命令,查询固件版本号明细。
      /usr/local/Ascend/driver/tools/upgrade-tool --device_index -1 --component -1 --version
      回显如下:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      {
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(0).
      {"device_id":0, "component":nve, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(3).
      {"device_id":0, "component":uefi, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(8).
      {"device_id":0, "component":imu, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(0), componentType(9).
      {"device_id":0, "component":imp, "version":7.1.0.7.220}
      
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(0).
      {"device_id":7, "component":nve, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(3).
      {"device_id":7, "component":uefi, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(8).
      {"device_id":7, "component":imu, "version":7.1.0.7.220}
      Get component version(7.1.0.7.220) succeed for deviceId(7), componentType(9).
      {"device_id":7, "component":imp, "version":7.1.0.7.220}
      }