Elastic Training

If a hardware fault occurs and no backup resource is available in the Kubernetes cluster, MindCluster scales in some nodes in data parallel domains to continue training. When there are available idle resources in the cluster, scale-out is triggered to restore the original training scale. Compared with process-level rescheduling, this function solves the problem that no backup resource is available in the cluster.

Restrictions

To use this function in the PyTorch scenario, MindSpeed-LLM 2.3.0 must be used together. For details about the version mapping, see MindSpeed-LLM.
Only training jobs of the acjob type support this function.
This function depends on optimizer replicas of MindIO, requiring full optimizer replicas. MindIO and TaskD must be installed and used together.
This function cannot be enabled together with graceful fault tolerance.
If the pod whose hccl/rankIndex field in the annotation of a training job is 0 is faulty, elastic training cannot be triggered.
Multimodal models are not supported.
Watchdog cannot be enabled.
Elastic training creates additional communication groups, which may increase the on-chip memory usage.
You can use the formula "Maximum increased memory size (MB) = HCCL_BUFFSIZE × 2 × 9" to increase the memory size. The default value of HCCL_BUFFSIZE is 200 MB. For details about HCCL_BUFFSIZE, see "HCCL_BUFFSIZE" in CANN Environment Variable Reference.

For more usage restrictions, see MindSpeed-LLM Elastic Training Restrictions.

Supported Products and AI Frameworks

**Table 1** Products and frameworks supported for elastic training
Product Type	Hardware Form	Training Framework
Atlas A2 training products	Atlas 800T A2 training server	PyTorch
Atlas A3 training products	Atlas 900 A3 SuperPoD	PyTorch

Elastic Training Principles

Figure 1 Schematic diagram

In the figure, only one DP domain is scaled in. In actual elastic training, multiple DP domains may be scaled in at a time. Each square in the figure represents a rank.

Distributed training is performed properly based on tensor parallelism (TP), pipeline parallelism (PP), and data parallelism (DP) strategies.
At a certain moment during training, if a rank is faulty and no more idle resources in the cluster can be scheduled for resumable training, a DP domain is scaled in, that is, pods (possibly more than one pod) corresponding to the DP domain are removed to continue training.
At a certain moment during scale-in training, if idle resources are available in the cluster, the removed pods are rescheduled, and the cluster is scaled out to the original scale for further training.

Figure 2 Elastic training flowchart

The details of each step are as follows:

After a hardware fault occurs on a device, the detection component of MindCluster installed on the server reports the fault information to ClusterD, while MindIO Controller in the container detects a software fault and reports it to ClusterD.
ClusterD destroys the job container on the faulty server.
If no backup node is available to schedule the new container, ClusterD instructs MindIO Controller on the master node to perform scale-in training.
MindIO Controller notifies MindIO Processor in each training process, and MindIO Processor calls PTA to stop training processes and clear resources of the normal node.
MindIO Controller notifies MindIO Processor in each normal training process, and MindIO Processor calls PTA to perform scale-in training such as communication group rebuilding.
It is deleted that pods during scale-in are successfully rescheduled.
ClusterD notifies MindIO Controller through TaskD Manager to perform scale-out operations.
MindIO Controller notifies MindIO Processor in each training process, and MindIO Processor calls PTA to stop training processes and clear resources of the normal node.
Each process establishes links for collective communication.
NPUs of the normal server transfer the checkpoint data to the standby server through the parameter plane. After the parameter status is restored, the training continues.

Parent topic: Fault Handling