Job-Level Rescheduling
If this mode is enabled, all pods are stopped each time a fault occurs. After the faulty pods are re-created and rescheduled, a training job is restarted. This mode is used by default.
For details about the key configuration steps of job-level rescheduling, see Configuring Job-Level Rescheduling.
Restrictions
- This function can be used only in 6.0.RC2 or later.
- In a large-scale Kubernetes cluster, the ConfigMap mapping delay is uncontrollable. You are advised to use the shared storage for the RankTable.
Supported Products and AI Frameworks
Rescheduling Principles
If a software or hardware fault occurs during training, the training status becomes abnormal. Job-level rescheduling first destroys all training containers, isolates the faulty device, and then restarts and schedules training containers. Once restarted, the training process resumes from the beginning, similar to an initial training launch.
Figure 1 Principles


The steps in the figure are described as follows:
- After a fault is detected, delete all pods and containers of the current job.
- Isolate the device where the fault occurs to prevent the device from being used again.
- Recreate and schedule training pods and containers.
- Restart containers and training processes to resume training.
Parent topic: Fault Handling