Overview
The Ascend Device Management Interface (Ascend DMI) performs detection by utilizing the underlying Device Control and Management Interface (DCMI) and Ascend Computing Language (AscendCL). Additionally, it retrieves system information by calling the general-purpose libraries provided by the system. Ascend DMI provides the compatibility check, bandwidth test, computing power test, power consumption test, and diagnosis or stress test for Atlas hardware products. For details, see Table 1.
Function |
Description |
Whether NPU Training or Inference Is Affected |
|---|---|---|
Help information viewing |
Checks the help information of Ascend DMI. |
No |
Version information viewing |
Checks the version information of Ascend DMI. |
No |
Bandwidth test |
Measures the bus bandwidth, memory bandwidth, and total time consumption. |
Yes |
P2P bandwidth test on a SuperPoD |
Measures the network transmission rate and total time consumption between nodes. |
Yes |
Computing power test |
Measures the computing power of the AI Core in the entire NPU, chip, or server and the real-time power in full computing power mode. |
Yes |
Power consumption test |
Detects the power consumption of the entire NPU. |
Yes |
Real-time device status query |
Checks the status of the device in running. |
No |
Fault diagnosis |
Performs diagnosis or stress tests on software and hardware, and output diagnosis or stress test results. The check items are as follows: For fault diagnosis:
For stress testing:
|
NPU training or inference jobs are affected by on-chip memory/high-risk address of on-chip memory/AICORE/power consumption/AICPU/P2P/DSA stress testing and on-chip memory/AICORE/bandwidth/computing power/PRBS stream/NIC diagnosis. |
Eye diagram test |
Queries the current signal quality. |
No |
Stream test |
Checks the communication signal quality of hardware links by sending and receiving PRBS streams to and from the RoCE network port of the NPU. |
Yes |
NPU environment restoration |
Resets the Ascend AI Processor through the standard PCIe hot reset process. |
Yes |
Software and hardware compatibility test |
Checks the hardware and software compatibility based on the hardware information, architecture, driver version, firmware version, and software version obtained. |
No |
- If an error is reported when the preceding functions are used, an error code is generated in the corresponding log. To query the error code, see "Data Types and Operation APIs" > "aclError" in CANN Application Development APIs and DCMI API Return Codes.
- When using the preceding functions, you are advised to perform the next step after the process is complete. You are not advised to terminate the process during the execution.