Training Job Hangs on Atlas 800 training server with Error: int_process_hwts_sdma_timeout

Symptom

When a training job is executed on Atlas 800 training server, the training job is suspended. The following command is used to collect driver logs:

msnpureport -f

The error int_process_hwts_sdma_timeout is reported in the log, as shown in the following figure.

Cause Analysis

The NPU chip is in AMP mode, but HCCL does not support the AMP mode of the NPU chip in the Atlas 800 training server environment.

Solution

Switch the NPU working mode to SMP and then run the training job. Run the following commands on the iBMC:

# Powering off
ipmcset -d powerstate -v 2
# Query the NPU working mode.
ipmcget -d npuworkmode
# Switch to the SMP mode.
ipmcset -d npuworkmode -v 1
# Powering on
ipmcset -d powerstate -v 1

If the NPU working mode is not switched to SMP and you want to continue to use the NPU chip, reset the NPU chip or restart the node.

# Reset the chip. id indicates the device ID, and chip_id indicates the chip ID.
npu-smi set -t reset -i id -c chip_id
# (Optional) Restart the node.
reboot