[D2D Bandwidth] Performance Degradation Caused by Enabled ECC

Symptom

When Ascend DMI performs fault diagnosis, an error message is displayed, indicating that the detected bandwidth is lower than the reference value.

Possible Causes

The ECC function of the driver is enabled in the current environment. As a result, the bandwidth test result does not meet the expectation.

Solution

Query the ECC function status of the current driver. If the status is True, disable the ECC function of the driver. To locate and rectify the fault, perform the following steps.

  1. Run the command to check the current status of the ECC function of the driver.
    npu-smi info -t ecc-enable -i 0

    The -i parameter specifies the ID of the processor to be queried.

  2. If the ECC function status is True, run the following command to disable it.
    npu-smi set -t ecc-enable -i 0 -d 0

    Repeat 1 to query the current status of the ECC function, which should be False.

  3. Run the following command to diagnose the fault. The command output indicates that the detected bandwidth is normal.
    ascend-dmi --dg
    Figure 1 Fault diagnostics result