Performing Fault Diagnosis
Function
Obtain the processor health information and perform computing power, power consumption, and bandwidth tests on the processor, and determine the health status of the current product based on the test results.
Commands for Querying Test Parameters
You can run either of the following commands to list the parameters of the fault diagnostics command:
ascend-dmi --dg -h
ascend-dmi --dg --help
Table 1 describes the parameters.
Parameter |
Description |
Mandatory |
|---|---|---|
[-dg, --dg, --diagnosis] |
Performs a fault diagnostic test of the entire card. -dg is supported, but --dg or --diagnosis is recommended. If you do not use the default path when installing the CANN package or the product is an Atlas 500 AI edge station, you need to use the [-p, --path] parameter to specify the installation path. |
Yes |
[-i, --items] |
Specifies the diagnosis items, including:
You can specify one or more items among driver, cann, device, network, bandwidth, and aiflops. Separate the items with commas (,). For details about the check item status, see Table 2. |
No |
[-d, --device] |
Specifies the ID of the device to be diagnosed. The device ID is the ID of the Ascend AI Processor. You can run the ascend-dmi --info command to obtain the number of processors from the Chip parameter displayed. For example, if an Atlas 300I inference card is configured with four Ascend AI Processors, the value of Device ID ranges from 0 to 3.
|
No |
[-p, --path] |
Specifies the installation path of the NNRT or NNAE package. The specified path must meet security requirements and cannot contain the wildcard (*).
|
No |
Leave -i, -d, and -p unspecified |
The diagnosis results of all check items of all devices are returned. |
No |
[-fmt, --fmt, --format] |
Specifies the output format. The value can be normal or json. If this parameter is not specified, the default value normal is used. |
No |
- To ensure the correctness and accuracy of the test result, perform the fault diagnostic test separately.
- If multiple level-2 parameters such as -i and -d are added behind ascend-dmi --dg, you can specify the sequence of these parameters. This does not affect the command output. For example, the output of ascend-dmi --dg -i driver,cann,device -d 0,1 -p /usr/local/Ascend is the same as that of ascend-dmi --dg -d 0,1 -i driver,cann,device -p /usr/local/Ascend.
Example
The command output on an inference server is similar to that on a training server. The following uses the screenshots on a training server as an example.
The following describes how to specify the diagnosis item, device ID, and software package installation path.
ascend-dmi --dg -i driver,cann,device -d 0,1 -p /usr/local/Ascend

Fault Check Items
Type |
Check Item |
Command Output |
Description |
|---|---|---|---|
Hardware |
driver |
HEALTH |
The driver firmware is properly installed, and the driver status is healthy. |
FAIL |
|
||
GENERAL_WARN |
General warning (For details, see the displayed error information.) |
||
IMPORTANT_WARN |
Important warning (For details, see the displayed error information.) |
||
EMERGENCY_WARN |
Emergent warning (For details, see the displayed error information.) |
||
device |
SKIP |
The product does not support this test. |
|
HEALTH |
The device is healthy. |
||
FAIL |
The device fails. |
||
GENERAL_WARN |
General warning (For details, see the displayed error information.) |
||
IMPORTANT_WARN |
Important warning (For details, see the displayed error information.) |
||
EMERGENCY_WARN |
Emergent warning (For details, see the displayed error information.) |
||
network |
SKIP |
The product does not support this test. |
|
FAIL |
The network fails. |
||
WARN |
Alarm generated for the network. |
||
PASS |
The network is healthy. |
||
aiflops |
FAIL |
|
|
WARN |
The tested computing power is greater than the minimum reference value, but less than the reference warning value. |
||
PASS |
The tested computing power is normal (greater than the reference warning value). |
||
bandwidth |
FAIL |
|
|
WARN |
The tested bandwidth is greater than the minimum reference value, but less than the reference warning value. |
||
PASS |
The tested bandwidth is normal (greater than the reference warning value). |
||
Software |
cann |
FAIL |
|
PASS |
Tested cann is normal. |