One-Click Diagnosis
Function
Currently, fault diagnosis supports multiple diagnostic items, such as computing power, bandwidth, and signal quality. Different parameters need to be specified for each diagnostic item. It takes a long time to execute a single diagnostic item in sequence. However, in real-world application scenarios, such as inspection scenarios, multiple items need to be diagnosed to determine the health status of the current product.
Therefore, Ascend DMI processes existing diagnostic items by level. You can specify a diagnosis scenario and diagnose multiple items at a time to improve O&M efficiency. For details about the diagnosis scenarios, see Table 1.
Scenario |
Diagnostic Item |
Diagnosis Duration (Atlas Inference Products) |
Diagnosis Duration (Other Products) |
Whether NPU Training or Inference Is Affected |
|---|---|---|---|---|
healthCheck |
CANN, diver, device, network, signal quality, on-chip memory diagnosis |
≤ 2 minutes |
≤ 2 minutes |
Yes |
performanceCheck |
Bandwidth/Aiflops/NIC |
14 minutes–3 hours |
14 minutes–3 hours |
Yes |
stressTest |
AICORE/full on-chip memory/P2P/power consumption/AICPU stress test |
7.5 hours–9.5 hours |
1.5 hours–3.5 hours |
Yes |
Parameters
Table 2 describes the parameters.
Example
The following describes how to perform healthCheck, performanceCheck, and stressTest at the same time and skip the foolproof prompt.
ascend-dmi --dg --se healthCheck,performanceCheck,stressTest -q
- Command output for the
Atlas A2 training product andAtlas A3 training product :[root@l****]# ascend-dmi --dg --scene healthCheck,performanceCheck,stressTest -q Summary: Arch: aarch64 Mode: ***** Time: 20251111-16:16:23 Hardware: driver: HEALTH device: HEALTH network: PASS signalQuality: PASS hbm: PASS bandwidth: PASS aiflops: PASS hbmStress: PASS bandwidthStress: PASS aicore: PASS edp: PASS tdp: PASS aicpu: PASS nic: PASS Software: cann: PASS
- Command output for the Atlas 300I Pro inference card, Atlas 300V video analysis card, and Atlas 300V Pro video analysis card:
[root@l****]# ./ascend-dmi --dg --se healthCheck,performanceCheck,stressTest -q Summary: Arch: aarch64 Mode: ***** Time: 20251111-07:27:32 Hardware: driver: HEALTH device: HEALTH network: SKIP *** The current device does not support the network health diagnosis. signalQuality: SKIP *** Current server does not support signal quality diagnosis. chipMemory: PASS bandwidth: PASS aiflops: PASS chipMemoryStress: PASS bandwidthStress: SKIP *** The current device does not support the p2p stress test. aicore: SKIP *** The current device does not support the Aicore diagnosis. edp: SKIP *** Current server does not support TDP/EDP. tdp: SKIP *** Current server does not support TDP/EDP. aicpu: SKIP *** The current device does not support the Aicpu diagnosis. nic: SKIP *** The current device does not support the nic diagnosis. Software: cann: PASS
Fault Diagnosis Check Items
scene |
Item |
Command Output |
Description |
|---|---|---|---|
healthCheck |
CANN |
PASS |
The test on CANN is normal. |
FAIL |
|
||
Driver |
HEALTH |
The driver and firmware are properly installed, and the driver status is healthy. |
|
GENERAL_WARN |
General warning (For details, see the displayed error information.) |
||
IMPORTANT_WARN |
Important warning (For details, see the displayed error information.) |
||
EMERGENCY_WARN |
Emergency warning (For details, see the displayed error information.) |
||
FAIL |
|
||
Chip |
HEALTH |
The device is healthy. |
|
SKIP |
The current product or scenario does not support this function. |
||
GENERAL_WARN |
General warning (For details, see the displayed error information.) |
||
IMPORTANT_WARN |
Important warning (For details, see the displayed error information.) |
||
EMERGENCY_WARN |
Emergent warning (For details, see the displayed error information.) |
||
WARN |
An unknown UB device is faulty. |
||
FAIL |
The device check fails. |
||
Network |
PASS |
The network is healthy. |
|
SKIP |
The current product or scenario does not support this function. |
||
INFO |
Information displayed for the network. |
||
WARN |
Alarm generated for the network. |
||
FAIL |
The network check fails. |
||
On-chip memory diagnosis |
PASS |
The on-chip memory check is passed and no exception occurs. |
|
SKIP |
The product or scenario does not support on-chip memory detection. |
||
GENERAL_WARN |
There are historical isolation pages with multi-bit errors. 0x80E18401 is generated to warn NPU health management faults. If the number of these pages falls within the range of [16, 64), the normal operation does not affected. |
||
EMERGENCY_WARN |
|
||
FAIL |
|
||
Signal quality |
PASS |
The check is passed, and the signal quality of the PCIe, HCCS, and RoCE communication ports on the NPU is normal. |
|
SKIP |
The product or scenario does not support eye diagram diagnosis. |
||
IMPORTANT_WARN |
Important warning. At least one of the signal qualities for PCIe, HCCS, or RoCE is abnormal. Contact Huawei technical support. |
||
FAIL |
The eye diagram detection fails. |
||
performanceCheck |
Aiflops |
PASS |
The computing power test result is normal (greater than the reference value). |
WARN |
Processor overtemperature is triggered during the computing power test. |
||
FAIL |
|
||
Bandwidth |
PASS |
The bandwidth test result is normal. |
|
FAIL |
|
||
NIC diagnosis |
PASS |
The NPU network port connectivity is normal and the network port bandwidth reaches the baseline value. |
|
GENERAL_WARN |
|
||
IMPORTANT_WARN |
The NPU network port bandwidth does not reach the baseline value. |
||
FAIL |
|
||
SKIP |
The product or scenario does not support NIC diagnosis. |
||
stressTest |
AICORE diagnosis |
PASS |
The diagnosis result is normal. |
SKIP |
The product or scenario does not support AICORE diagnosis. |
||
EMERGENCY_WARN |
Emergency warning. You are advised to replace the hardware. |
||
FAIL |
|
||
On-chip memory stress test |
PASS |
The on-chip memory stress test is passed. |
|
SKIP |
The product or scenario does not support the on-chip memory stress test. |
||
FAIL |
|
||
BandWidthStress |
PASS |
The stress test is passed, and the result is normal. |
|
SKIP |
The product or scenario does not support the P2P stress test. |
||
EMERGENCY_WARN |
Emergency warning. The stress test fails. You are advised to replace the hardware. |
||
FAIL |
Failed to call the API. Contact Huawei technical support. |
||
Power consumption stress test |
PASS |
The power consumption stress test result is normal. |
|
SKIP |
The product or scenario does not support the power consumption stress test. |
||
IMPORTANT_WARN |
A processor alarm is generated during the stress test. Handle the alarm based on the description. If the problem persists, contact Huawei technical support. |
||
FAIL |
The power consumption stress test fails. Contact Huawei technical support. |
||
AICPU stress test |
PASS |
The stress test result is normal. |
|
SKIP |
The product or scenario does not support the AICPU stress test. |
||
EMERGENCY_WARN |
Emergency warning. Replace the hardware. |
||
FAIL |
The AICPU stress test fails. Contact Huawei technical support. |
||
Note:
|
|||
