Graceful Fault Tolerance Mode
This section describes how to view the information about a training process that uses the graceful fault tolerance mode for fault handling. When a processor fault occurs, graceful fault tolerance processing is carried out after the process exits. Once the fault has been rectified, the process restarts.
Log Description
Training logs of the restarted training process are stored in training_script_path/newlog. The details are as follows:
- QWEN3 (PyTorch) training log: /data/atlas_dls/public/code/QWEN3_for_PyTorch_2.7_code/alllogs
- QWEN3 (MindSpore) training log: /data/atlas_dls/public/code/QWEN3_for_MS_code/alllogs
Procedure
- Log in to the management node and run the following command to check the processor status:
npu-smi info
If the following information is displayed, the training process occupies the on-chip memory and runs properly.

- After a fault occurs, run the following command to check the processor status:
npu-smi info
If the following information is displayed, the training process has exited and the on-chip memory has been freed.

- After the fault is rectified, run the following command to check the processor status:
npu-smi info
If the following information is displayed, the training process has been restarted. It now occupies the on-chip memory and runs properly.
