On-Chip Memory Diagnosis

Function

Diagnose the on-chip memory and output the diagnosis result.

Table 1 Diagnostic items

Item

Time Required

Whether NPU Training or Inference Is Affected

Application Scenario

On-chip memory diagnosis

2–4 seconds

No

An on-chip memory ECC error occurs on the NPU during training or inference.

  • The on-chip memory stress test and on-chip memory diagnosis apply to different scenarios. For details, see Table 1. Perform the on-chip memory stress test or on-chip memory diagnosis as required.
  • If you want to conduct the on-chip memory diagnosis, on-chip memory stress test, and on-chip memory high-risk address stress test at the same time, refer to One-Click On-Chip Memory Stress Test.

Parameters

Table 2 lists only a test-specific parameter. For details about other common parameters, see Common Parameters.

Table 2 Parameter description

Parameter

Description

Mandatory

[-i, --items]

Specifies the diagnosis check item.

  • Currently, only hbm and chipMemory are supported. hbm and chipMemory cannot be specified at the same time.
    • For the Atlas A2 training products and Atlas A2 inference productsAtlas A3 training products and Atlas A3 inference products, set the value to hbm.
    • For the Atlas 300I Pro inference card, Atlas 300V video analysis card, Atlas 300V Pro video analysis card, and Atlas 300I Duo inference card, set the value to chipMemory.

Yes

Example

  • hbm of the Atlas A2 training product

    ascend-dmi -dg -i hbm

    1
    2
    3
    4
    5
    6
    7
    8
    9
    [***@***]# ascend-dmi -dg -i hbm
    Summary:
        Arch: aarch64
        Mode: ******
        Time: 20250529-19:25:25
     
    Hardware:
        hbm:
            PASS
    
  • chipMemory of the Atlas 300I Duo inference card

    ascend-dmi -dg -i chipMemory

    1
    2
    3
    4
    5
    6
    7
    8
    9
    [***@***]# ascend-dmi -dg -i chipMemory
    Summary:
        Arch: aarch64
        Mode: ******
        Time: 20250529-19:25:25
     
    Hardware:
        chipMemory:
            PASS
    

Fault Check Items

Table 3 Fault check items

Command Output

Description

PASS

The on-chip memory check is passed and no exception occurs.

SKIP

The product or scenario does not support on-chip memory detection.

GENERAL_WARN

There are historical isolation pages with multi-bit errors. 0x80E18401 is generated to warn NPU health management faults. If the number of these pages falls within the range of [16, 64), the normal operation does not affected.

NOTE:

When the diagnostic item is chipMemory, no alarm of this severity is generated.

EMERGENCY_WARN

  • The number of historical isolation pages with multi-bit errors and lines designated for device isolation is excessive. 0x80E18402 is generated to warn NPU health management faults. You are advised to use spare parts.
  • If the number of isolation lines in different banks in the same stack and PC is greater than or equal to 4, the current device is at high risk, and you are advised to use spare parts.
  • If the number of isolation lines within the same stack, sharing the same SID, but in different PCs is greater than or equal to 4, the current device is at high risk, and you are advised to use spare parts.
  • If the number of isolation lines in the same stack, PC, and bank, and sharing the same SID is greater than 16, the current device is at high risk, and you are advised to use spare parts.
  • Excluding the adjacent incorrect addresses within four bits with the same stack, SID, PC, and bank, if the number of other different addresses is greater than 5, the current device is at high risk, and you are advised to use spare parts.
  • If the number of real-time isolation pages with multi-bit errors is greater than or equal to 64, the current device is at high risk, and you are advised to use spare parts.
    NOTE:

    For the Atlas 300I Pro inference card, Atlas 300V video analysis card, Atlas 300V Pro video analysis card, and Atlas 300I Duo inference card, if the NPU fault code is 0x80DF8402 or the number of real-time isolation pages with multi-bit ECC errors is greater than or equal to 64, the current device is at high risk, and you are advised to use spare parts.

FAIL

The on-chip memory check fails. Contact Huawei technical support or locate the fault by referring to the FAQs.