NPU Environment Restoration

Function

Reset the Ascend AI Processor through the standard PCIe hot reset process. The NPU environment restoration is required in the following scenarios:

  • After the AICORE stress test and diagnosis are complete, the AICORE and bus voltages are abnormal.
  • An NPU is disconnected during AICORE stress testing and diagnosis. That is, the NPU cannot be detected when you run the npu-smi info command to query basic device information. In this case, power off and restart the device and restore the NPU environment after device restart.
  • An NPU is disconnected during AICPU stress testing. That is, the NPU cannot be detected when you run the npu-smi info command to query basic device information. In this case, power off and restart the device and restore the NPU environment after device restart.

Preparations

Before calling the NPU reset API, stop NPU-related services, which can be queried by fuser. For details, see Querying NPU Service Processes.

Parameters

You can run either of the following commands to view the parameters of the NPU restoration command:

ascend-dmi -r -h

ascend-dmi --reset --help

Table 1 lists only a test-specific parameter. For details about other common parameters, see Common Parameters.

Table 1 Parameter description

Parameter

Description

Mandatory

[-r, --reset]

Resets the NPU.

Yes

Note:

  • For the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, Atlas 900 A2 PoDc cluster basic unit, Atlas 200T A2 Box16/Atlas 200I A2 Box16 heterogeneous subrack, Atlas 800I A2 inference server, and A200I A2 Box heterogeneous component, only the first device in the device list is hot reset. If the hot reset is successful, all NPUs are successfully reset. If the hot reset fails, all NPUs fail to be reset.

Example

ascend-dmi -r -d

1
2
3
[***@***]# ascend-dmi -r -d 0,1,2 -q
Status           : PASS
Message          : Reset server successfully.

Fault Check Items

Table 2 Parameters in the command output

Parameter

Command Output

Description

status

PASS

The environment is restored successfully.

SKIP

The product or scenario does not support NPU environment restoration.

FAIL

Failed to restore the environment.

The failure causes are as follows:

  • Other NPU processes occupy the NPU.
  • The device is abnormal (for example, card disconnection).
    NOTE:

    NPU disconnection: When the npu-smi info command is used to query basic device information, the NPU cannot be detected.

Message

-

Lists NPU environment restoration details.