Fault Handling Description

Once fault detection is complete, resumable training can restore the training service through fault handling or tolerance mechanisms across different fault modes, including job-level rescheduling, pod-level rescheduling, process-level rescheduling, elastic training, operator-level online recovery, and process-level online recovery. You can choose the appropriate sub-feature based on your requirements.

Figure 1 Fault handling description

In the figure, Mean Time to Repair (MTTR) represents the duration from fault occurrence to recovery. Success rate measures the effectiveness of fault recovery after an issue arises. Usability evaluates the cost of implementing or integrating a fault policy.

Job-level rescheduling, pod-level rescheduling, and process-level rescheduling support all fault modes supported by resumable training, but depend on backup redundant compute server resources. If there is an unrecoverable hardware fault and no backup redundant compute server, you can configure elastic training to perform scale-in training. Process-level online recovery is applicable to on-chip memory faults and network faults. Operator-level online recovery supports processor network faults and UnifiedBus network faults.

The multi-layer fault handling system of resumable training supports rollback at each layer based on recovery granularity. As shown in Figure 2, if recovery at a higher layer fails, the process can revert to the next lower layer.

Figure 2 Recovery failure description

Rescheduling Mode

Rescheduling mode: The job is scheduled to a healthy processor and the faulty processor is isolated.

By default, job-level rescheduling is used, where all pods are stopped upon each fault. However, in large-scale jobs, the cost of rescheduling after stopping all pods is high, leading to prolonged fault recovery. Therefore, resumable training also supports pod-level rescheduling. When a fault occurs, you can stop only the faulty pods and reschedule a few pods based on the job scale, to quickly rectify the fault. To accelerate recovery and minimize fault impact, resumable training also supports process-level rescheduling and process-level online recovery.

**Table 1** Differences between rescheduling levels
Rescheduling Level	Time Required for Resuming Training	Configuration Procedure	Description
Job-level rescheduling	Job-level rescheduling is time-consuming, with recovery time increasing exponentially as task scale grows.	MindCluster users only need to enable the corresponding configuration item. For details about the key procedure, see Configuring Job-Level Rescheduling.	To further reduce the resource scheduling time during recovery, you can enable pod-level rescheduling together with job-level rescheduling.
Pod-level rescheduling	Pod-level rescheduling reduces resource scheduling time regardless of job scale. However, it does not optimize the time overhead associated with training initialization. As job scale expands, overall recovery time continues to increase superlinearly.	Pod-level rescheduling requires the training container to support process management. MindCluster users can access this feature once they acquire the necessary process management capability. For details about the key procedure, see Configuring Pod-Level Rescheduling.	To further reduce the recovery time during training initialization, you can enable process-level rescheduling together with pod-level rescheduling.
Process-level rescheduling (Process-level recovery)	Process-level rescheduling can reduce training initialization time and shorten the overall recovery time, which is irrelevant or weakly related to the job scale.	Compared to pod-level rescheduling, process-level rescheduling necessitates integrating high-availability training capabilities into the training framework. MindCluster users must modify the training script and enable the relevant configuration item to utilize this feature. For details about the key procedure, see Configuring Process-Level Rescheduling.	To address the short MTBF issue in large-scale scenarios and further minimize overall recovery time, process-level online recovery can be enabled alongside process-level rescheduling.
Process-level online recovery	Process-level online recovery takes less time than process-level rescheduling.	Users need to enable the corresponding configuration item before using this feature. For details about the key procedure, see Configuring Process-Level Online Recovery.	Process-level online recovery is applicable to on-chip memory faults and network faults. In other fault scenarios, use other processing methods.
Operator-level online recovery	-	For details about the key procedure, see Configuring Operator-Level Online Recovery.	-

The rescheduling mode has the following policies:

Direct rescheduling: If a hardware fault that can be detected by the cluster scheduling components occurs during training, the system isolates the faulty node or processor and directly reschedules the job.
Unconditional retry: If a fault that cannot be detected by the cluster scheduling components occurs during training and the job container exits abnormally, the system unconditionally reschedules the job.

**Table 2** Rescheduling policy description
Rescheduling Policy	Description	Supported Fault Type
Direct rescheduling	The system isolates the faulty node or processor and then directly reschedules the corresponding job.	Known node faults or processor faults at the rescheduling processing level.
Unconditional retry	The system reschedules a job for the configured number of unconditional retry times. After successful rescheduling, the number of retry times decreases by 1. When the number of retry times reaches 0, rescheduling cannot be triggered again. NOTE: To enable unconditional retry, set fault-retry-times in the YAML file. For details, see YAML Parameters.	Faults that cause jobs to exit abnormally and pod status to become Failed, which are caused by parameter plane network faults or training software faults.

Parent topic: Fault Handling