(Optional) Graceful Fault Tolerance
This function has been deprecated. It will not be supported in PyTorch versions beyond 7.2.RC1 and MindSpore versions beyond 7.1.RC1.
You can enable graceful fault tolerance if no backup resources are available for training jobs or if you expect a device to automatically recover. That is, if a processor is faulty during training, the system attempts to automatically recover the faulty processor. If it can be recovered, the system starts the job to continue the training while the pod is still running. If the fault persists, the system rolls back to the rescheduling mode.
Graceful fault tolerance can automatically recover the faulty device without resource scheduling. However, it cannot reduce the recovery time during training initialization. Generally, the recovery time required by graceful fault tolerance is longer than that required by process-level rescheduling and process-level online recovery.
For details about the key configuration steps of graceful fault tolerance, see Configuring Graceful Fault Tolerance.
Restrictions
- Currently, graceful fault tolerance can be used only for processor faults.
- Graceful fault tolerance cannot be enabled together with process-level rescheduling and process-level online recovery. If they are enabled at the same time, resumable training will be conducted through job-level rescheduling.
Supported Products and AI Frameworks
Product Type |
Product |
Training Framework |
|---|---|---|
Atlas training products |
|
|
Atlas A2 training products |
|
|
Atlas A3 training products |
|
|
Principles of Graceful Fault Tolerance
If rescheduling is triggered during node or processor fault handling, O&M personnel need to manually restore the faulty device. If it is not restored in a timely manner, a large number of scattered faults may occur in a training cluster, reducing the cluster computing power utilization. Therefore, graceful fault tolerance is added for resumable training to optimize the fault tolerance capability of NPUs for some faults.
These NPU faults can be rectified by exiting the training processes and performing hot resets on the NPUs. The graceful fault tolerance mode is designed to handle such faults and does not require job rescheduling.
Ascend Device Plugin reports faults and recovers devices. The management process (Elastic Agent for PyTorch and TaskD for MindSpore) stops and restarts training processes based on the information reported by Ascend Device Plugin to complete fault recovery. If faults cannot be recovered, the rescheduling mode is used again. To integrate the graceful fault tolerance mode, add a management process to the service container. The management process must have the capabilities of detecting faults, stopping training jobs, and restarting training jobs.
In graceful fault tolerance mode, a fault is directly reported to the management process in the service container (usually by mounting a file). The management process in the container then reads the fault file to obtain specific fault information. Figure 1 shows the process of obtaining the fault information.
Faults are classified into four types in graceful fault tolerance mode: no handling required, service reexecution required, processor reset required, and rescheduling required. Figure 2 describes the troubleshooting process of each type.

