(Optional) Graceful Fault Tolerance

This function has been deprecated. It will not be supported in PyTorch versions beyond 7.2.RC1 and MindSpore versions beyond 7.1.RC1.

You can enable graceful fault tolerance if no backup resources are available for training jobs or if you expect a device to automatically recover. That is, if a processor is faulty during training, the system attempts to automatically recover the faulty processor. If it can be recovered, the system starts the job to continue the training while the pod is still running. If the fault persists, the system rolls back to the rescheduling mode.

Graceful fault tolerance can automatically recover the faulty device without resource scheduling. However, it cannot reduce the recovery time during training initialization. Generally, the recovery time required by graceful fault tolerance is longer than that required by process-level rescheduling and process-level online recovery.

For details about the key configuration steps of graceful fault tolerance, see Configuring Graceful Fault Tolerance.

Restrictions

Currently, graceful fault tolerance can be used only for processor faults.
Graceful fault tolerance cannot be enabled together with process-level rescheduling and process-level online recovery. If they are enabled at the same time, resumable training will be conducted through job-level rescheduling.

Supported Products and AI Frameworks

**Table 1** Products and frameworks supported by graceful fault tolerance
Product Type	Product	Training Framework
Atlas training products	Atlas 800 training server (model 9000) Atlas 800 training server (model 9010)	MindSpore PyTorch
Atlas A2 training products	Atlas 800T A2 training server Atlas 900 A2 PoD cluster basic unit	MindSpore PyTorch
Atlas A3 training products	Atlas 900 A3 SuperPoD Atlas 800T A3 SuperPoD Server	MindSpore PyTorch

Principles of Graceful Fault Tolerance

If rescheduling is triggered during node or processor fault handling, O&M personnel need to manually restore the faulty device. If it is not restored in a timely manner, a large number of scattered faults may occur in a training cluster, reducing the cluster computing power utilization. Therefore, graceful fault tolerance is added for resumable training to optimize the fault tolerance capability of NPUs for some faults.

These NPU faults can be rectified by exiting the training processes and performing hot resets on the NPUs. The graceful fault tolerance mode is designed to handle such faults and does not require job rescheduling.

Ascend Device Plugin reports faults and recovers devices. The management process (Elastic Agent for PyTorch and TaskD for MindSpore) stops and restarts training processes based on the information reported by Ascend Device Plugin to complete fault recovery. If faults cannot be recovered, the rescheduling mode is used again. To integrate the graceful fault tolerance mode, add a management process to the service container. The management process must have the capabilities of detecting faults, stopping training jobs, and restarting training jobs.

In graceful fault tolerance mode, a fault is directly reported to the management process in the service container (usually by mounting a file). The management process in the container then reads the fault file to obtain specific fault information. Figure 1 shows the process of obtaining the fault information.

Figure 1 Obtaining fault information

Faults are classified into four types in graceful fault tolerance mode: no handling required, service reexecution required, processor reset required, and rescheduling required. Figure 2 describes the troubleshooting process of each type.

Figure 2 Fault handling process in graceful fault tolerance mode

Parent topic: Fault Handling