On-Chip Memory Stress Test Fails Due to Insufficient Device Memory

Symptom

Ascend DMI fails to perform the on-chip memory stress test, and the message "Error occurred in HBM stress test on device 0" is displayed. In addition, the error message "aclrtMalloc failed, error code: 207001" is displayed in the log.

The following information is printed in /var/log/ascend-dmi/ascend-dmi.log:

Possible Causes

The device memory is insufficient or occupied.

Solution

  1. Run the npu-smi info command to check whether the memory is used up. If the following information is displayed, the memory is used up.

  2. Wait for the memory to be released or run the following command to reset the processor to release the memory:
    npu-smi set -t reset -i $i -c 0                // Replace $i with the specified device ID.
    Figure 1 Command example