Overview

The Ascend Device Management Interface (Ascend DMI) performs detection by utilizing the underlying Device Control and Management Interface (DCMI) and Ascend Computing Language (AscendCL). Additionally, it retrieves system information by calling the general-purpose libraries provided by the system. Ascend DMI provides the compatibility check, bandwidth test, computing power test, power consumption test, and diagnosis or stress test for Atlas hardware products. For details, see Table 1.

Table 1 Functions

Function

Description

Whether NPU Training or Inference Is Affected

Help information viewing

Checks the help information of Ascend DMI.

No

Version information viewing

Checks the version information of Ascend DMI.

No

Bandwidth test

Measures the bus bandwidth, memory bandwidth, and total time consumption.

Yes

P2P bandwidth test on a SuperPoD

Measures the network transmission rate and total time consumption between nodes.

Yes

Computing power test

Measures the computing power of the AI Core in the entire NPU, chip, or server and the real-time power in full computing power mode.

Yes

Power consumption test

Detects the power consumption of the entire NPU.

Yes

Real-time device status query

Checks the status of the device in running.

No

Fault diagnosis

Performs diagnosis or stress tests on software and hardware, and output diagnosis or stress test results. The check items are as follows:

For fault diagnosis:

  • Software: CANN-driver compatibility/driver health diagnosis
  • Hardware: chip/network health/bandwidth/computing power diagnosis/on-chip memory/eye diagram/AICORE/NIC/PRBS stream diagnosis

For stress testing:

  • Hardware: on-chip memory/high-risk address of the on-chip memory/AICORE/P2P/power consumption/AICPU/DSA stress testing

NPU training or inference jobs are affected by on-chip memory/high-risk address of on-chip memory/AICORE/power consumption/AICPU/P2P/DSA stress testing and on-chip memory/AICORE/bandwidth/computing power/PRBS stream/NIC diagnosis.

Eye diagram test

Queries the current signal quality.

No

Stream test

Checks the communication signal quality of hardware links by sending and receiving PRBS streams to and from the RoCE network port of the NPU.

Yes

NPU environment restoration

Resets the Ascend AI Processor through the standard PCIe hot reset process.

Yes

Software and hardware compatibility test

Checks the hardware and software compatibility based on the hardware information, architecture, driver version, firmware version, and software version obtained.

No

  • If an error is reported when the preceding functions are used, an error code is generated in the corresponding log. To query the error code, see "Data Types and Operation APIs" > "aclError" in CANN Application Development APIs and DCMI API Return Codes.
  • When using the preceding functions, you are advised to perform the next step after the process is complete. You are not advised to terminate the process during the execution.