One-Click On-Chip Memory Stress Test

Function

Ascend DMI supports one-click on-chip memory stress test. That is, you can run commands only once to perform on-chip memory diagnosis, on-chip memory stress test, and on-chip memory high-risk address stress test and output the test results.

Table 1 Diagnostic items

Item

Time Required

Whether NPU Training or Inference Is Affected

Application Scenario

One-click on-chip memory stress test

< 1.5 hours

Yes

When a training or inference job is executed, an on-chip memory ECC occurs on the NPU, and a new isolation page is added.

  • The on-chip memory stress test and on-chip memory diagnosis apply to different scenarios. For details, see Table 1. Perform the on-chip memory stress test or on-chip memory diagnosis as required.

  • If you want to conduct the on-chip memory diagnosis, on-chip memory stress test, and on-chip memory high-risk address stress test at the same time, refer to One-Click On-Chip Memory Stress Test.

Parameters

Table 2 lists only test-specific parameters. For details about other common parameters, see Common Parameters.

Table 2 Parameter description

Parameter

Description

Mandatory

[-at, --at, --auto-test]

Performs an automatic stress test.

This parameter takes effect only when [-i, --items] contains hbm and -s is specified.

Yes

[-st, --st, --stress-time]

Specifies the time required by the on-chip memory stress test. The command for combined stress tests additionally performs functions such as on-chip memory diagnosis and high-risk address stress test. Therefore, the actual execution time is longer than the specified time.

  • The value ranges from 60 to 604800, in seconds.
  • This parameter must be used together with [-s, --stress] in the scenario where on-chip memory check items are included.

No

Example

ascend-dmi -dg -i hbm -s --auto-test -q

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
[***@***]# ascend-dmi -dg -i hbm -s --auto-test -q
Stress test is being performed, please wait.
Summary:
    Arch: aarch64
    Mode: ******
    Time: 20250529-19:08:50

Hardware:
    hbm:
        PASS

Fault Check Items

Table 3 Parameters in the command output

Parameter

Description

PASS

The one-click on-chip memory stress test is successful and no exception occurs.

EMERGENCY_WARN

  • The number of historical isolation pages with multi-bit errors and lines designated for device isolation is excessive. 0x80E18402 is generated to warn NPU health management faults. You are advised to use spare parts.
  • If the number of isolation lines in different banks in the same stack and PC is greater than or equal to 4, the current device is at high risk, and you are advised to use spare parts.
  • If the number of isolation lines within the same stack, sharing the same SID, but in different PCs is greater than or equal to 4, the current device is at high risk, and you are advised to use spare parts.
  • If the number of isolation lines in the same stack, PC, and bank, and sharing the same SID is greater than 16, the current device is at high risk, and you are advised to use spare parts.
  • Excluding the adjacent incorrect addresses within four bits with the same stack, SID, PC, and bank, if the number of other different addresses is greater than 5, the current device is at high risk, and you are advised to use spare parts.
  • During the stress test, the number of isolated pages increases for three consecutive times.

SKIP

The product or scenario does not support the one-click on-chip memory stress test.

FAIL