Performing Fault Diagnosis

Function

Obtain the processor health information and perform computing power, power consumption, and bandwidth tests on the processor, and determine the health status of the current product based on the test results.

Commands for Querying Test Parameters

You can run either of the following commands to list the parameters of the fault diagnostics command:

ascend-dmi --dg -h

ascend-dmi --dg --help

Table 1 describes the parameters.

Table 1 Parameter description

Parameter

Description

Mandatory

[-dg, --dg, --diagnosis]

Performs a fault diagnostic test of the entire card. -dg is supported, but --dg or --diagnosis is recommended. If you do not use the default path when installing the CANN package or the product is an Atlas 500 AI edge station, you need to use the [-p, --path] parameter to specify the installation path.

Yes

[-i, --items]

Specifies the diagnosis items, including:

  • Software:
    • cann: compatibility between CANN software at each layer and between CANN and drivers.
  • Hardware:
    • driver: driver health diagnosis.
    • device: device health diagnosis.
    • network: network health diagnosis. Only Ascend 910 is supported.
    • bandwidth: local bandwidth, including the Host to Device, Device to Host, Device to Device and Peer to Peer directions.
    • aiflops: computing power

You can specify one or more items among driver, cann, device, network, bandwidth, and aiflops. Separate the items with commas (,). For details about the check item status, see Table 2.

No

[-d, --device]

Specifies the ID of the device to be diagnosed. The device ID is the ID of the Ascend AI Processor. You can run the ascend-dmi --info command to obtain the number of processors from the Chip parameter displayed. For example, if an Atlas 300I inference card is configured with four Ascend AI Processors, the value of Device ID ranges from 0 to 3.

  • If the -d parameter is used, P2P bandwidth verification is not performed by default.
  • If the check items behind [-i, --items] contain device, network, bandwidth, or aiflops, this parameter must be verified. You can specify one or more device IDs. Separate device IDs with commas (,). If the device ID is not specified, the diagnosis results of all devices are returned by default.
  • If the check items behind [-i, --items] contain only cann or driver, you do not need to set this parameter.

No

[-p, --path]

Specifies the installation path of the NNRT or NNAE package. The specified path must meet security requirements and cannot contain the wildcard (*).

  • If the check items behind [-i, --items] contain cann, this parameter must be verified. If you do not use the default installation path when installing the software package, this parameter must be set to the actual installation path. For the Atlas 500 AI edge station, you must specify the path as /opt/ascend/.
    NOTE:

    If this parameter is not specified and the software package is installed by the root user, the default path /usr/local/Ascend is used.

  • If the check items behind [-i, --items] do not contain cann, leave this parameter unspecified.

No

Leave -i, -d, and -p unspecified

The diagnosis results of all check items of all devices are returned.

No

[-fmt, --fmt, --format]

Specifies the output format. The value can be normal or json. If this parameter is not specified, the default value normal is used.

No

  • To ensure the correctness and accuracy of the test result, perform the fault diagnostic test separately.
  • If multiple level-2 parameters such as -i and -d are added behind ascend-dmi --dg, you can specify the sequence of these parameters. This does not affect the command output. For example, the output of ascend-dmi --dg -i driver,cann,device -d 0,1 -p /usr/local/Ascend is the same as that of ascend-dmi --dg -d 0,1 -i driver,cann,device -p /usr/local/Ascend.

Example

The command output on an inference server is similar to that on a training server. The following uses the screenshots on a training server as an example.

The following describes how to specify the diagnosis item, device ID, and software package installation path.

ascend-dmi --dg -i driver,cann,device -d 0,1 -p /usr/local/Ascend

Fault Check Items

Table 2 Fault check items

Type

Check Item

Command Output

Description

Hardware

driver

HEALTH

The driver firmware is properly installed, and the driver status is healthy.

FAIL

  • The driver or firmware is incorrectly installed.
  • Failed to read the driver health status.

GENERAL_WARN

General warning (For details, see the displayed error information.)

IMPORTANT_WARN

Important warning (For details, see the displayed error information.)

EMERGENCY_WARN

Emergent warning (For details, see the displayed error information.)

device

SKIP

The product does not support this test.

HEALTH

The device is healthy.

FAIL

The device fails.

GENERAL_WARN

General warning (For details, see the displayed error information.)

IMPORTANT_WARN

Important warning (For details, see the displayed error information.)

EMERGENCY_WARN

Emergent warning (For details, see the displayed error information.)

network

SKIP

The product does not support this test.

FAIL

The network fails.

WARN

Alarm generated for the network.

PASS

The network is healthy.

aiflops

FAIL

  • Failed to test the computing power.
  • The tested computing power is less than the minimum reference value.

WARN

The tested computing power is greater than the minimum reference value, but less than the reference warning value.

PASS

The tested computing power is normal (greater than the reference warning value).

bandwidth

FAIL

  • Failed to test the bandwidth.
  • The tested bandwidth is less than the minimum reference value.

WARN

The tested bandwidth is greater than the minimum reference value, but less than the reference warning value.

PASS

The tested bandwidth is normal (greater than the reference warning value).

Software

cann

FAIL

  • The installations of nnae, nnrt, and toolkit are abnormal.
  • The driver installation is abnormal. (The compatibility between cann and the driver does not meet the requirements.)

PASS

Tested cann is normal.