Job-Level Rescheduling

If this mode is enabled, all pods are stopped each time a fault occurs. After the faulty pods are re-created and rescheduled, a training job is restarted. This mode is used by default.

For details about the key configuration steps of job-level rescheduling, see Configuring Job-Level Rescheduling.

Restrictions

This function can be used only in 6.0.RC2 or later.
In a large-scale Kubernetes cluster, the ConfigMap mapping delay is uncontrollable. You are advised to use the shared storage for the RankTable.

Supported Products and AI Frameworks

**Table 1** Products and frameworks that support job-level rescheduling
Product Type	Hardware Form	Training Framework
Atlas training products	Atlas 800 training server (model 9000) Atlas 800 training server (model 9010) NOTE: If the processor working mode of the Atlas 800 training server is SMP and each pod allocates one or two NPUs, rescheduling is not supported. For details about how to query and set the working mode of an NPU, see Querying and Setting the Working Mode of an NPU (npuworkmode) in Atlas 800 Training Server iBMC User Guide (Model 9000).	MindSpore TensorFlow PyTorch
Atlas A2 training products	Atlas 800T A2 training server Atlas 200T A2 Box16 heterogeneous subrack Atlas 900 A2 PoD cluster basic unit	MindSpore TensorFlow PyTorch
Atlas A3 training products	Atlas 900 A3 SuperPoD Atlas 800T A3 SuperPoD Server	MindSpore TensorFlow PyTorch
A200T A3 Box8 SuperPoD Server	A200T A3 Box8 SuperPoD Server	MindSpore TensorFlow PyTorch

Rescheduling Principles

If a software or hardware fault occurs during training, the training status becomes abnormal. Job-level rescheduling first destroys all training containers, isolates the faulty device, and then restarts and schedules training containers. Once restarted, the training process resumes from the beginning, similar to an initial training launch.

Figure 1 Principles

The steps in the figure are described as follows:

After a fault is detected, delete all pods and containers of the current job.
Isolate the device where the fault occurs to prevent the device from being used again.
Recreate and schedule training pods and containers.
Restart containers and training processes to resume training.

Parent topic: Fault Handling