Process-Level Online Recovery

Process-level online recovery (also referred to as step-level recomputation recovery) is used to rectify the following faults:

Network faults
- If BGP switches its link upon an HCCS L1-L2 port or link fault and operator-level online recovery fails, step-level recomputation is triggered to quickly rectify the fault without exiting processes. If operator-level online recovery is disabled, step-level recomputation is performed on training processes to rectify the fault without process interruption.
- If operator-level online recovery fails to be executed upon an RoCE upper-level port or link fault, the training process is retried at the step level to quickly rectify the fault without exiting processes.
On-chip memory faults: If an uncorrectable error (such as error 0x80E01801) occurs on the on-chip memory, the faulty on-chip memory space is isolated, and step-level recomputation is performed on training processes to quickly rectify the fault without exiting processes.

If faults cannot be rectified in the preceding two scenarios, rescheduling mode is then triggered.

Compared with process-level rescheduling, process-level online recovery does not reschedule faulty processes, reducing waiting time for inter-process synchronization. In addition, checkpoint information is transmitted via high-speed network P2P between NPUs, eliminating the overhead of saving and loading checkpoints.

This fault handling mode is disabled by default. To enable it, see (Optional) Configuring Components.

For details about the key configuration steps of process-level online recovery, see Configuring Process-Level Online Recovery.

Checkpoint transmission over the parameter plane relies on the presence of optimizer replicas on the normal NPU. If no replica is available, parameters are restored by loading the checkpoint file from storage.
Since optimizer replicas consume additional device memory, you can switch to local loading mode when device memory is insufficient. In this mode, parameters are restored directly from the checkpoint file in storage zone, regardless of the existence of optimizer replicas.

Restrictions

The following version mapping requirements must be met for process-level online restoration.
- PyTorch 2.7.1
- MindSpeed-LLM 2.3.0

This function depends on the memory management mechanism of PyTorch. This function can be used only when PYTORCH_NO_NPU_MEMORY_CACHING is not configured.
This function does not address certain on-chip memory faults. For instance, memory address faults in HCCL collective communication must be rectified using process-level rescheduling or an upper-layer fault tolerance solution.
For details about how to handle global variable faults defined in models or training scripts, such as MindSpeed-LLM and MindSpeed, see FAQs.
This function cannot be enabled together with graceful fault tolerance. If they are enabled at the same time, resumable training will be conducted through job-level rescheduling.
In the MindSpore scenario, to ensure the normal use of this mode, install MindSpore and MindIO in the same path.
In the MindSpore scenario, you need to set export TASKD_PROCESS_ENABLE to on before starting TaskD Manager.
Do not use the ConfigMap to mount the RankTable file. Otherwise, job rescheduling may fail.
Multimodal models are not supported.
MC2 cannot be enabled.
Watchdog cannot be enabled.
If faults occur in the HCCL link setup phase, process-level online recovery fails. If HCCL link setup is required in other training phases in addition to the training initialization phase, you can set up the link in advance by referring to Configuring HCCL Link Setup to avoid faults during the setup process.

Supported Products and AI Frameworks

**Table 1** Products and frameworks supported by process-level online recovery for network faults
Product Type	Product	Training Framework
Atlas A3 training products	Atlas 900 A3 SuperPoD Atlas 800T A3 SuperPoD Server	MindSpore PyTorch

**Table 2** Products and frameworks supported by process-level online recovery for on-chip memory faults
Product Type	Product	Training Framework
Atlas A2 training products	Atlas 800T A2 training server Atlas 900 A2 PoD cluster basic unit Atlas 900 A2 PoDc cluster basic unit	MindSpore PyTorch
Atlas A3 training products	Atlas 900 A3 SuperPoD Atlas 800T A3 SuperPoD Server	MindSpore PyTorch

Process-level Online Recovery Principles

If an on-chip memory or network fault occurs during training, the training status will become abnormal. Process-level online recovery notifies all training processes to stop, retains the current training information, and rectifies the fault. Once recovery is complete, all training processes revert to the status at the end of the previous step. The healthy server transfers the checkpoint data to the affected server via the parameter plane to restore parameters. Training then resumes by re-executing the current step.

Figure 1 Process-level online recovery principles

The steps in the figure are described as follows:

After an on-chip memory or network fault occurs on the device, the detection component of MindCluster on the server reports the fault information to ClusterD.
CANN detects the on-chip memory or network fault and reports the fault to MindIO Processor and MindIO Controller through the training framework.
MindIO Controller requests the cluster brain to determine whether to perform step-level recomputation recovery. The cluster brain makes a decision based on the health status of other nodes in the cluster.
MindIO Controller notifies MindIO Processor in each training process to call the training framework to stop the job, rectify the fault, and retain the communicator information.
NPUs of the healthy server transfer checkpoint data to the repaired server through the parameter plane. Once parameter status is restored, training resumes and the current step computation restarts.

Function Adaptation Points

During process-level online recovery, the cluster brain identifies network faults and on-chip memory faults based on fault information, and delivers the corresponding recovery policy, with support for recovery policy rollback. In the training container, the framework initializes the MindIO service. After the service is started, the optimizer updates the corresponding status to MindIO. Then, a DP replica group and optimizer replicas are created to ensure redundant backup of model parameters. When an exception occurs, the decorator captures the related fault mode. During process-level online recovery, operator resources are cleared, the UCE model optimizer is rebuilt, the parameter plane is repaired online, and the status is rolled back.

For non-MindSpeed-LLM/MindCluster users, adapt the following functions on the framework.

**Table 3** Functions adapted for process-level online recovery for network faults
Function	Description	Adapted Component	Reference Link
Boot while initialization	The MindIO service is started while a training framework is initialized.	Distributed training framework	Adapting to non-MindSpeed-LLM Framework
Optimizer update status reporting	Before optimizer update, the start and end of the update process are reported.
Exception capture decorator	The decorator is used to decorate the train function to capture fault modes.
Operator resource clearing	The callback function is used to clear operator resources.
Status rollback	A callback function is used to rebuild data iterators and reset framework variables.
Recovery policy decision	Fault information is used to determine whether network or on-chip memory faults have occurred, to deliver corresponding recovery policy, with support for recovery policy rollback.	AI platform	See here.
Scheduling of faulty pods	Faulty pods are scheduled, and scheduling policy rollback is supported.	AI platform	See here.

**Table 4** Functions adapted for process-level online recovery for on-chip memory faults
Function	Description	Adapted Component	Reference Link
Boot while initialization	The MindIO service is started while a training framework is initialized.	Distributed training framework	Adapting to non-MindSpeed-LLM Framework
Optimizer update status reporting	Before optimizer update, the start and end of the update process are reported.
DP replica group creation	The creation logic of dp_cp/dp_ep replica groups and gloo groups is added. The replica groups are created after native Megatron distributed parallel groups are created.
Optimizer replica	The functions of the native Megatron optimizer are inherited, with MindIO optimizer replica management logic embedded.
Exception capture decorator	The decorator is used to decorate the train function to capture fault modes.
Operator resource clearing	The callback function is used to clear operator resources.
UCE model optimizer rebuilding	A callback function is used to clear and rebuild the model optimizer object on the faulty rank.
Online parameter plane repair	A callback function is used to restore replica and recovery ranks.
Status rollback	A callback function is used to rebuild data iterators and reset framework variables.
Recovery policy decision	Fault information is used to determine whether network or on-chip memory faults have occurred, to deliver corresponding recovery policy, with support for recovery policy rollback.	AI platform	See here.
Scheduling of faulty pods	Faulty pods are scheduled, and scheduling policy rollback is supported.	AI platform	See here.

Parent topic: Fault Handling