Job-Level Rescheduling

If this mode is enabled, all pods are stopped each time a fault occurs. After the faulty pods are re-created and rescheduled, a training job is restarted. This mode is used by default.

For details about the key configuration steps of job-level rescheduling, see Configuring Job-Level Rescheduling.

Restrictions

  • This function can be used only in 6.0.RC2 or later.
  • In a large-scale Kubernetes cluster, the ConfigMap mapping delay is uncontrollable. You are advised to use the shared storage for the RankTable.

Supported Products and AI Frameworks

Table 1 Products and frameworks that support job-level rescheduling

Product Type

Hardware Form

Training Framework

Atlas training products

  • Atlas 800 training server (model 9000)
  • Atlas 800 training server (model 9010)
    NOTE:

    If the processor working mode of the Atlas 800 training server is SMP and each pod allocates one or two NPUs, rescheduling is not supported. For details about how to query and set the working mode of an NPU, see Querying and Setting the Working Mode of an NPU (npuworkmode) in Atlas 800 Training Server iBMC User Guide (Model 9000).

  • MindSpore
  • TensorFlow
  • PyTorch

Atlas A2 training products

  • Atlas 800T A2 training server
  • Atlas 200T A2 Box16 heterogeneous subrack
  • Atlas 900 A2 PoD cluster basic unit
  • MindSpore
  • TensorFlow
  • PyTorch

Atlas A3 training products

  • Atlas 900 A3 SuperPoD
  • Atlas 800T A3 SuperPoD Server
  • MindSpore
  • TensorFlow
  • PyTorch

A200T A3 Box8 SuperPoD Server

A200T A3 Box8 SuperPoD Server

  • MindSpore
  • TensorFlow
  • PyTorch

Rescheduling Principles

If a software or hardware fault occurs during training, the training status becomes abnormal. Job-level rescheduling first destroys all training containers, isolates the faulty device, and then restarts and schedules training containers. Once restarted, the training process resumes from the beginning, similar to an initial training launch.

Figure 1 Principles

The steps in the figure are described as follows:

  1. After a fault is detected, delete all pods and containers of the current job.
  2. Isolate the device where the fault occurs to prevent the device from being used again.
  3. Recreate and schedule training pods and containers.
  4. Restart containers and training processes to resume training.