On-Chip Memory Stress Test

Function

Perform a stress test on the on-chip memory and output the diagnosis result.

Table 1 Diagnostic items

Item

Diagnosis Duration (Atlas Inference Products)

Diagnosis Duration (Other Products)

Whether NPU Training or Inference Is Affected

Application Scenario

On-chip memory stress test

6–7 hours

< 1 hour

Yes

The stress test is performed before a training or inference job is rolled out, or an ECC on the NPU on-chip memory is detected during job execution.

  • The on-chip memory stress test and on-chip memory diagnosis apply to different scenarios. For details, see Table 1. Perform the on-chip memory stress test or on-chip memory diagnosis as required.
  • If you want to conduct the on-chip memory diagnosis, on-chip memory stress test, and on-chip memory high-risk address stress test at the same time, refer to One-Click On-Chip Memory Stress Test.

Parameters

Table 2 lists only test-specific parameters. For details about other common parameters, see Common Parameters.

Table 2 Parameter description

Parameter

Description

Mandatory

[-i, --items]

Specifies the diagnosis check item.

  • Currently, only hbm and chipMemory are supported. hbm and chipMemory cannot be specified at the same time.
    • For the Atlas A2 training products and Atlas A2 inference productsAtlas A3 training products and Atlas A3 inference products, set the value to hbm.
    • For the Atlas 300I Pro inference card, Atlas 300V video analysis card, Atlas 300V Pro video analysis card, and Atlas 300I Duo inference card, set the value to chipMemory.

Yes

[-st, --st, --stress-time]

Specifies the time required by the on-chip memory stress test.

  • The value ranges from 60 to 604800, in seconds.
  • This parameter must be used together with [-s, --stress] in the scenario where on-chip memory check items are included.

No

Example

  • Example of hbm test on the Atlas 800I A2 inference server with the test duration setting to 60s:

    ascend-dmi -dg -i hbm -s -st 60 -q

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    [***@***]# ascend-dmi -dg -i hbm -s -st 60 -q
    Stress test is being performed, please wait.
    Summary:
        Arch: aarch64
        Mode: ******
        Time: 20250529-19:36:47
     
    Hardware:
        hbm:
            PASS
    
  • Example of chipMemory test on the Atlas 300I Duo inference card with the test duration setting to 60s:

    ascend-dmi -dg -i chipMemory -s -st 60 -q

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    [***@***]# ascend-dmi -dg -i chipMemory -s -st 60 -q
    Stress test is being performed, please wait.
    Summary:
        Arch: aarch64
        Mode: ******
        Time: 20250529-19:25:25
     
    Hardware:
        chipMemory:
            PASS
    

Fault Check Items

Table 3 Fault check items

Command Output

Description

PASS

The on-chip memory stress test is passed.

SKIP

The product or scenario does not support the on-chip memory stress test.

FAIL

  • The on-chip memory stress test fails, and a new multi-bit isolation page is added. You can perform the on-chip memory stress test after on-chip memory diagnosis. For details, see Figure 1.
  • The software fails to be executed.

Refer to On-Chip Memory Stress Test Fails Due to Insufficient Device Memory.

Figure 1 On-chip memory stress test and diagnosis