Pod-Level Rescheduling

If this mode is enabled, only the faulty pods are stopped each time a fault occurs. After the faulty pods are re-created and rescheduled, a training job is restarted. If the fault cannot be rectified, job-level rescheduling is triggered. Compared with job-level rescheduling, pod-level rescheduling reduces the time for resource scheduling and pod creation.

For details about the key configuration steps of pod-level rescheduling, see Configuring Pod-Level Rescheduling.

Restrictions

  • When pod-level rescheduling is used for a training job in a large cluster, you are advised to set the open files parameter (maximum number of files that can be opened) to a large value. If the value is too small, pod rescheduling may be abnormal. For example, run the ulimit -n 100000 command to set open files to 100000.
  • When the pod with hccl/rankIndex being 0 under annotation of a training job is faulty, pod-level rescheduling and process-level rescheduling are not triggered. Instead, job-level rescheduling is triggered.
  • Do not use the ConfigMap to mount the RankTable file. Otherwise, job rescheduling may fail.

Supported Products and AI Frameworks

Table 1 Products and frameworks supported by the rescheduling mode

Product Type

Hardware Form

Training Framework

Atlas training products

  • Atlas 800 training server (model 9000)
  • Atlas 800 training server (model 9010)
    NOTE:

    If the processor working mode of the Atlas 800 training server is SMP and each pod allocates one or two NPUs, rescheduling is not supported. For details about how to query and set the working mode of an NPU, see Querying and Setting the Working Mode of an NPU (npuworkmode) in Atlas 800 Training Server iBMC User Guide (Model 9000).

  • MindSpore
  • TensorFlow
  • PyTorch

Atlas A2 training products

  • Atlas 800T A2 training server
  • Atlas 200T A2 Box16 heterogeneous subrack
  • Atlas 900 A2 PoD cluster basic unit
  • MindSpore
  • TensorFlow
  • PyTorch

Atlas A3 training products

  • Atlas 900 A3 SuperPoD

  • Atlas 800T A3 SuperPoD Server
  • MindSpore
  • TensorFlow
  • PyTorch

A200T A3 Box8 SuperPoD Server

A200T A3 Box8 SuperPoD Server

  • MindSpore
  • TensorFlow
  • PyTorch

Rescheduling Principles

If a software or hardware fault occurs during training, the training status becomes abnormal. Pod-level rescheduling destroys the faulty pods and training containers in the job, instructs the management processes in other training containers to destroy all training processes, isolates the faulty device, and reschedules and restarts training containers. Once restarted, management processes in all containers are notified to restart training processes to resume training.

  1. After a fault is detected, delete faulty pods and containers in the job and destroy all training processes.
  2. Isolate the device where the fault occurs to prevent the device from being used again.
  3. Recreate and schedule training pods and containers.
  4. Restart containers and training processes to resume training.