Process-Level Rescheduling

This mode stops only the processes of the faulty node each time a fault occurs and determines whether to exit the faulty node based on the configured policy.

  • recover policy: Migrate containers of the faulty node to the healthy node.
  • recover-in-place policy: Only faulty processes are restarted, without migrating containers of the faulty node. Only the following two types of faults can be handled:
    • Service process exceptions
    • Processor faults at the RestartRequest and RestartBusiness levels.

If the fault cannot be rectified, the job-level or pod-level rescheduling mode is used. Compared with pod-level rescheduling, this mode reschedules only faulty processes, reducing waiting time for inter-process synchronization. In addition, the new HCCL link setup solution significantly shortens link setup time. Checkpoint information is transmitted via high-speed network P2P between NPUs, eliminating the overhead of saving and loading checkpoints.

For details about the key configuration steps of process-level rescheduling, see Configuring Process-Level Rescheduling.

  • Checkpoint transmission over the parameter plane relies on the presence of optimizer replicas on the faulty NPU. If no replica is available, parameters are restored by loading the checkpoint file from storage.
  • Since optimizer replicas consume additional device memory, you can switch to local loading mode when device memory is insufficient. In this mode, parameters are restored directly from the checkpoint file in storage zone, regardless of the existence of optimizer replicas.

Restrictions

  • The following version mapping requirements must be met to trigger process-level rescheduling.
    • PyTorch 2.7.1
    • MindSpeed-LLM 2.3.0
  • When the pod with hccl/rankIndex being 0 under annotation of a training job is faulty, pod-level rescheduling and process-level rescheduling are not triggered. Instead, job-level rescheduling is triggered.
  • This function cannot be enabled together with graceful fault tolerance. If they are enabled at the same time, resumable training will be conducted through job-level rescheduling.
  • In the MindSpore scenario, to ensure the normal use of this mode, install MindSpore and MindIO in the same path.
  • Do not use the ConfigMap to mount the RankTable file. Otherwise, job rescheduling may fail.
  • PyTorch supports only single-operator mode, Megatron-based models, and training jobs of the acjob type.

  • Only single-container porting is supported. Affinity-based porting is not supported.
  • Multimodal models are not supported.
  • The watchdog function is not supported.
  • If an NPU or OS is disconnected in Atlas A3 training product, process-level rescheduling may fail.
  • If faults occur in the HCCL link setup phase, process-level rescheduling fails. If HCCL link setup is required in other training phases in addition to the training initialization phase, you can set up the link in advance by referring to Configuring HCCL Link Setup to avoid faults during the setup process.

Supported Products and AI Frameworks

Table 1 Products and frameworks supported by the rescheduling mode

Product Type

Hardware Form

Training Framework

Atlas A2 training products

  • Atlas 800T A2 training server

  • Atlas 200T A2 Box16 heterogeneous subrack
  • Atlas 900 A2 PoD cluster basic unit
  • MindSpore
  • PyTorch

Atlas A3 training products

  • Atlas 900 A3 SuperPoD
  • Atlas 800T A3 SuperPoD Server
  • MindSpore

  • PyTorch

Rescheduling Principles

If a software or hardware fault occurs during training, the training status will become abnormal. Process-level rescheduling destroys the faulty training process or container based on the configured policy, instructs the training processes in other training containers to suspend the current training job, isolates the faulty device, and reschedules and restarts the training container. Once restarted, the training processes in all containers are notified to rebuild collective communication links. After the links are rebuilt, the checkpoint is transferred to the restarted training processes via the parameter plane for parameter restoration. Following this, all processes repeat the current training step to resume training.

Figure 1 Process-level rescheduling

The steps in the figure are described as follows:

  1. After a hardware fault occurs on a device, the detection component of MindCluster installed on the server reports the fault information to ClusterD. While MindIO Controller in the container detects a software fault and reports it to ClusterD.
  2. ClusterD destroys the job container on the faulty server and reschedules it to the standby server.
  3. ClusterD notifies MindIO Controller on the master node to perform fault tolerance. The process includes stopping training, reporting the global fault, and notifying the recovery policy.
  4. MindIO Controller notifies MindIO Processor in each training process, and MindIO Processor calls PTA to forcibly stop the training process. MindIO Processor clears resources of the normal node, destroys the communicator, and waits for the new process to join after the clearing.
  5. After the management process on the standby server starts the training process, a new MindIO Processor is created. Then, MindIO Controller notifies MindIO Processor in each training process to resume training.
  6. Each process establishes links for collective communication.
  7. NPUs of the normal server transfer the checkpoint data to the standby server through the parameter plane. After the parameter status is restored, the training continues.

Function Adaptation Points

In process-level rescheduling, the cluster brain determines the recovery policy based on the global fault information and delivers the policy to MindIO. The scheduler needs to support fault pod scheduling instead of rescheduling the entire job. The recovery policy can be rolled back in sequence. In the training container, the framework initializes the MindIO service. After the service is started, the optimizer updates the corresponding status to MindIO. Then, a DP replica group and optimizer replicas are created to ensure redundant backup of model parameters. Exceptions are captured by the decorator. During recovery, operator resources are cleared, and communication re-establishment is triggered after the related node is restarted. Process-level rescheduling recovery is implemented through online repair and status rollback on the parameter plane.

For non-MindSpeed-LLM/MindCluster users, adapt the following functions as listed in Table 2.

Table 2 Functions adapted for process-level rescheduling

Function

Description

Adapted Component

Reference Link

Boot while initialization

The MindIO service is started while a training framework is initialized.

Distributed training framework

Adapting to non-MindSpeed-LLM Framework

Optimizer update status reporting

Before optimizer update, the start and end of the update process are reported.

DP replica group creation

The creation logic of dp_cp/dp_ep replica groups and gloo groups is added. The replica groups are created after native Megatron distributed parallel groups are created.

Optimizer replica

The functions of the native Megatron optimizer are inherited, with MindIO optimizer replica management logic embedded.

Exception capture decorator

The decorator is used to decorate the train function to capture fault modes.

Operator resource clearing

A callback function is used to clear operator resources.

Node restart and communication re-establishment

A rebuild callback function is registered to rebuild the communicator between the healthy node and the faulty node.

Online parameter plane repair

A callback function is used to restore replica and recovery ranks.

Status rollback

A callback function is used to rebuild data iterators and reset framework variables.

Recovery policy decision

A recovery policy is determined based on the global fault information and delivered to MindIO. Recovery policy rollback is supported. If process-level rescheduling fails, pod- or job-level rescheduling is triggered.

AI platform

See here.

Scheduling of faulty pods

Faulty pods are scheduled, and scheduling policy rollback is supported.

See here.