Hot Switching

After the hotSwitch policy is configured for a training job, if a subhealth fault occurs, the training process is paused after the backup node is started, and then the training job is restarted using the backup node.

Restrictions

  • For PyTorch, this function must be used with MindSpeed-LLM 2.3.0. For details about the version mapping, see MindSpeed-LLM.
  • For MindSpore, this function must be used with MindFormers master. For details about the version mapping, see MindSpore MindFormers.
  • Only the PyTorch single-operator mode, Megatron-based models, and training jobs of the acjob type are supported.
  • In the MindSpore scenario, to ensure the normal use of this mode, install MindSpore and MindIO in the same path.
  • Multimodal models are not supported.
  • Watchdog cannot be enabled.
  • If hot switching triggered before the training iteration ends, MindIO may be blocked, triggering job-level rescheduling.
  • Hot switching is not supported if the pod annotated with hccl/rankIndex=0 in a training job is subhealthy.
  • If any of the following exceptions occurs, job-level rescheduling is triggered, and the subhealth node handling policy is downgraded to ignore, meaning that subhealth faults are not handled.
    • After the backup pod is started, training fails to be paused.
    • After the backup pod is started, a timeout occurs when MindCluster waits for the training suspension status report for 15 minutes.
    • The backup pod fails to run.
    • After the original pod is deleted, training fails to be resumed.
    • After the original pod is deleted, a timeout occurs when MindCluster waits for the training recovery status report for 15 minutes.
  • After the hotSwitch policy is configured, the process-level recovery option is automatically added. If a non-subhealth fault occurs, process-level recovery is triggered.
  • If no backup node is available, the hot switching process cannot be completed. In this case, the subhealth fault handling policy is degraded to ignore, and subhealth faults are not handled.

Supported Products and AI Frameworks

Table 1 Products and frameworks supported for hot switching

Product Type

Hardware Form

Training Framework

Atlas A2 training product

Atlas 800T A2 training server

  • MindSpore
  • PyTorch

Atlas A3 training product

Atlas 800T A3 SuperPoD Server

  • MindSpore
  • PyTorch

Hot Switching Principles

Figure 1 Schematic diagram

The details of each step are as follows:

  1. ClusterD detects a subhealth fault through Ascend Device Plugin.
  2. ClusterD determines whether to perform hot switching based on the configured policy.
  3. ClusterD instructs Ascend Operator to start the backup pod.
  4. Volcano schedules the backup pod.
  5. A new MindIO Processor is created in the backup pod, and MindIO Processor initiates a registration with MindIO Controller.
  6. MindIO Controller delivers a training suspension notification.
  7. MindIO Controller notifies ClusterD of training suspension.
  8. ClusterD instructs Volcano to delete the faulty pod.
  9. ClusterD instructs MindIO to resume training.

Function Adaptation Points

During hot switching, the cluster brain sets annotations for the faulty pod based on the subhealth fault information, starts and schedules the backup pod, and notifies MindIO of the hotSwitch policy. Training resumes after it is switched to the backup pod. In the training container, the framework initializes the MindIO service. After the service is started, the optimizer updates the corresponding status to MindIO. When an exception occurs, the decorator is used to capture fault modes. After a new node is started, training on the normal node is paused. Then, the communicator is rebuilt, the parameter plane of the new node is restored, and the node hot switching is complete after training is complete.

For non-MindSpeed-LLM/MindCluster users, adapt the following functions as listed in Table 2.

Table 2 Functions adapted for hot switching

Function

Description

Adapted Component

Reference Link

Boot while initialization

The MindIO service is started while a training framework is initialized.

Distributed training framework

Adapting to non-MindSpeed-LLM Framework

Optimizer update status reporting

Before optimizer update, the start and end of the update process are reported.

DP replica group creation

The creation logic of dp_cp/dp_ep replica groups and gloo groups is added. The replica groups are created after native Megatron distributed parallel groups are created.

Optimizer replica

The functions of the native Megatron optimizer are inherited, with MindIO optimizer replica management logic embedded.

Exception capture decorator

The decorator is used to decorate the train function to capture fault modes.

Node restart and communication re-establishment

A rebuild callback function is registered to rebuild the communicator between the healthy node and the faulty node.

Online parameter plane repair

A callback function is used to restore replica and recovery ranks.

Status rollback

A callback function is used to rebuild data iterators and reset framework variables.

Graceful suspension

The MindIO function is called at the end of the training iteration to implement active suspension.

Hot switching control

Annotations are set to manage backup and faulty pods, thereby managing the hot switching process.

AI platform

See here.

Pod creation and deletion

Pods are created or deleted by identifying specific annotations.

See here.