One-Click Diagnosis

Function

Currently, fault diagnosis supports multiple diagnostic items, such as computing power, bandwidth, and signal quality. Different parameters need to be specified for each diagnostic item. It takes a long time to execute a single diagnostic item in sequence. However, in real-world application scenarios, such as inspection scenarios, multiple items need to be diagnosed to determine the health status of the current product.

Therefore, Ascend DMI processes existing diagnostic items by level. You can specify a diagnosis scenario and diagnose multiple items at a time to improve O&M efficiency. For details about the diagnosis scenarios, see Table 1.

Table 1 Diagnosis scenarios

Scenario

Diagnostic Item

Diagnosis Duration (Atlas Inference Products)

Diagnosis Duration (Other Products)

Whether NPU Training or Inference Is Affected

healthCheck

CANN, diver, device, network, signal quality, on-chip memory diagnosis

≤ 2 minutes

≤ 2 minutes

Yes

performanceCheck

Bandwidth/Aiflops/NIC

14 minutes–3 hours

14 minutes–3 hours

Yes

stressTest

AICORE/full on-chip memory/P2P/power consumption/AICPU stress test

7.5 hours–9.5 hours

1.5 hours–3.5 hours

Yes

Parameters

Table 2 describes the parameters.

Table 2 Parameters

Parameter

Description

Mandatory

[-dg, --dg, --diagnosis]

Performs a fault diagnostic test of the entire NPU.

Yes

[-se, --scene, --se]

Specifies a diagnosis scenario. Currently, the following scenarios are supported:

  • healthCheck
  • performanceCheck
  • stressTest

Yes

Example

The following describes how to perform healthCheck, performanceCheck, and stressTest at the same time and skip the foolproof prompt.

ascend-dmi --dg --se healthCheck,performanceCheck,stressTest -q
  • Command output for the Atlas A2 training product and Atlas A3 training product:
    [root@l****]# ascend-dmi --dg --scene healthCheck,performanceCheck,stressTest -q
    Summary:
        Arch: aarch64
        Mode: *****
        Time: 20251111-16:16:23
    Hardware:
        driver:
            HEALTH
        device:
            HEALTH
        network:
            PASS
        signalQuality:
            PASS
        hbm:
            PASS
        bandwidth:
            PASS
        aiflops:
            PASS
        hbmStress:
            PASS
        bandwidthStress:
            PASS
        aicore:
            PASS
        edp:
            PASS
        tdp:
            PASS
        aicpu:
            PASS
        nic:
            PASS
    Software:
        cann:
            PASS
  • Command output for the Atlas 300I Pro inference card, Atlas 300V video analysis card, and Atlas 300V Pro video analysis card:
    [root@l****]# ./ascend-dmi --dg --se healthCheck,performanceCheck,stressTest -q
    Summary:
        Arch: aarch64
        Mode: *****
        Time: 20251111-07:27:32
    Hardware:
        driver:
            HEALTH
        device:
            HEALTH
        network:
            SKIP
            *** The current device does not support the network health diagnosis.
        signalQuality:
            SKIP
            *** Current server does not support signal quality diagnosis.
        chipMemory:
            PASS
        bandwidth:
            PASS
        aiflops:
            PASS
        chipMemoryStress:
            PASS
        bandwidthStress:
            SKIP
            *** The current device does not support the p2p stress test.
        aicore:
            SKIP
            *** The current device does not support the Aicore diagnosis.
        edp:
            SKIP
            *** Current server does not support TDP/EDP.
        tdp:
            SKIP
            *** Current server does not support TDP/EDP.
        aicpu:
            SKIP
            *** The current device does not support the Aicpu diagnosis.
        nic:
            SKIP
            *** The current device does not support the nic diagnosis.
    Software:
        cann:
            PASS

Fault Diagnosis Check Items

scene

Item

Command Output

Description

healthCheck

CANN

PASS

The test on CANN is normal.

FAIL

  • The CANN package fails to be installed.
  • The driver installation is abnormal. (The compatibility between CANN and the driver does not meet the requirements.)

Driver

HEALTH

The driver and firmware are properly installed, and the driver status is healthy.

GENERAL_WARN

General warning (For details, see the displayed error information.)

IMPORTANT_WARN

Important warning (For details, see the displayed error information.)

EMERGENCY_WARN

Emergency warning (For details, see the displayed error information.)

FAIL

  • The driver or firmware is incorrectly installed.
  • Failed to read the driver health status.

Chip

HEALTH

The device is healthy.

SKIP

The current product or scenario does not support this function.

GENERAL_WARN

General warning (For details, see the displayed error information.)

IMPORTANT_WARN

Important warning (For details, see the displayed error information.)

EMERGENCY_WARN

Emergent warning (For details, see the displayed error information.)

WARN

An unknown UB device is faulty.

FAIL

The device check fails.

Network

PASS

The network is healthy.

SKIP

The current product or scenario does not support this function.

INFO

Information displayed for the network.

WARN

Alarm generated for the network.

FAIL

The network check fails.

On-chip memory diagnosis

PASS

The on-chip memory check is passed and no exception occurs.

SKIP

The product or scenario does not support on-chip memory detection.

GENERAL_WARN

There are historical isolation pages with multi-bit errors. 0x80E18401 is generated to warn NPU health management faults. If the number of these pages falls within the range of [16, 64), the normal operation does not affected.

EMERGENCY_WARN

  • The number of historical isolation pages with multi-bit errors and lines designated for device isolation is excessive. 0x80E18402 is generated to warn NPU health management faults. You are advised to use spare parts.
  • If the number of isolation lines in different banks in the same stack and PC is greater than or equal to 4, the current device is at high risk, and you are advised to use spare parts.
  • If the number of isolation lines within the same stack, sharing the same SID, but in different PCs is greater than or equal to 4, the current device is at high risk, and you are advised to use spare parts.
  • If the number of isolation lines in the same stack, PC, and bank, and sharing the same SID is greater than 16, the current device is at high risk, and you are advised to use spare parts.
  • Excluding the adjacent incorrect addresses within four bits with the same stack, SID, PC, and bank, if the number of other different addresses is greater than 5, the current device is at high risk, and you are advised to use spare parts.
  • If the number of real-time isolation pages with multi-bit errors is greater than or equal to 64, the current device is at high risk, and you are advised to use spare parts.
    NOTE:

    For the Atlas 300I Pro inference card, Atlas 300V video analysis card, Atlas 300V Pro video analysis card, and Atlas 300I Duo inference card, if the NPU fault code is 0x80DF8402 or the number of real-time isolation pages with multi-bit ECC errors is greater than or equal to 64, the current device is at high risk, and you are advised to use spare parts.

FAIL

Signal quality

PASS

The check is passed, and the signal quality of the PCIe, HCCS, and RoCE communication ports on the NPU is normal.

SKIP

The product or scenario does not support eye diagram diagnosis.

IMPORTANT_WARN

Important warning.

At least one of the signal qualities for PCIe, HCCS, or RoCE is abnormal. Contact Huawei technical support.

FAIL

The eye diagram detection fails.

performanceCheck

Aiflops

PASS

The computing power test result is normal (greater than the reference value).

WARN

Processor overtemperature is triggered during the computing power test.

FAIL

  • The computing power test fails.
  • The computing power test result is less than the reference value.

Bandwidth

PASS

The bandwidth test result is normal.

FAIL

  • The bandwidth test fails.
  • The bandwidth test result is less than the reference value.
  • Solution: Contact Huawei technical support or locate the fault by referring to the section "FAQs".
  • Refer to Bandwidth Test.

NIC diagnosis

PASS

The NPU network port connectivity is normal and the network port bandwidth reaches the baseline value.

GENERAL_WARN

  • The NPU network port is down.
  • The network port between NPUs is not connected.

IMPORTANT_WARN

The NPU network port bandwidth does not reach the baseline value.

FAIL

  • hccn_tool security verification failed.
  • Failed to obtain the NPU network port status.
  • Failed to obtain the NPU network port rate.
  • Failed to obtain the IP address of the NPU network port.
  • Failed to test NPU network port connectivity.
  • Failed to reset the NPU network port.
  • Failed to test NPU network port bandwidth.

SKIP

The product or scenario does not support NIC diagnosis.

stressTest

AICORE diagnosis

PASS

The diagnosis result is normal.

SKIP

The product or scenario does not support AICORE diagnosis.

EMERGENCY_WARN

Emergency warning. You are advised to replace the hardware.

FAIL

On-chip memory stress test

PASS

The on-chip memory stress test is passed.

SKIP

The product or scenario does not support the on-chip memory stress test.

FAIL

  • The on-chip memory stress test fails, and a new multi-bit isolation page is added. You can perform the on-chip memory stress test after on-chip memory diagnosis. For details, see Figure 1.
  • The software fails to be executed.

BandWidthStress

PASS

The stress test is passed, and the result is normal.

SKIP

The product or scenario does not support the P2P stress test.

EMERGENCY_WARN

Emergency warning. The stress test fails. You are advised to replace the hardware.

FAIL

Failed to call the API. Contact Huawei technical support.

Power consumption stress test

PASS

The power consumption stress test result is normal.

SKIP

The product or scenario does not support the power consumption stress test.

IMPORTANT_WARN

A processor alarm is generated during the stress test. Handle the alarm based on the description. If the problem persists, contact Huawei technical support.

FAIL

The power consumption stress test fails. Contact Huawei technical support.

AICPU stress test

PASS

The stress test result is normal.

SKIP

The product or scenario does not support the AICPU stress test.

EMERGENCY_WARN

Emergency warning. Replace the hardware.

FAIL

The AICPU stress test fails. Contact Huawei technical support.

Note:

  • The device IDs in this document are processors' logic IDs.
  • In the signal quality diagnosis, if the values of SNR and HEH are 0, no RoCE or HCCS link is established between the specified devices.
Figure 1 On-chip memory stress test and diagnosis