Pod-Level Rescheduling
If this mode is enabled, only the faulty pods are stopped each time a fault occurs. After the faulty pods are re-created and rescheduled, a training job is restarted. If the fault cannot be rectified, job-level rescheduling is triggered. Compared with job-level rescheduling, pod-level rescheduling reduces the time for resource scheduling and pod creation.
For details about the key configuration steps of pod-level rescheduling, see Configuring Pod-Level Rescheduling.
Restrictions
- When pod-level rescheduling is used for a training job in a large cluster, you are advised to set the open files parameter (maximum number of files that can be opened) to a large value. If the value is too small, pod rescheduling may be abnormal. For example, run the ulimit -n 100000 command to set open files to 100000.
- When the pod with hccl/rankIndex being 0 under annotation of a training job is faulty, pod-level rescheduling and process-level rescheduling are not triggered. Instead, job-level rescheduling is triggered.
- Do not use the ConfigMap to mount the RankTable file. Otherwise, job rescheduling may fail.
Supported Products and AI Frameworks
Rescheduling Principles
If a software or hardware fault occurs during training, the training status becomes abnormal. Pod-level rescheduling destroys the faulty pods and training containers in the job, instructs the management processes in other training containers to destroy all training processes, isolates the faulty device, and reschedules and restarts training containers. Once restarted, management processes in all containers are notified to restart training processes to resume training.
- After a fault is detected, delete faulty pods and containers in the job and destroy all training processes.
- Isolate the device where the fault occurs to prevent the device from being used again.
- Recreate and schedule training pods and containers.
- Restart containers and training processes to resume training.