Product Description

Overview

The MindCluster MindIO Training Fault Tolerance (MindIO TFT for short) provides functions such as dying gasp checkpoint saving, process-level online recovery, and process-level rescheduling, which correspond to the following:

The MindCluster MindIO Try To Persist (MindIO TTP for short) speeds up fault recovery during foundation model training. When a fault occurs during training, MindIO TTP checks the integrity and consistency of intermediate data and creates dying gasp checkpoint data, which can be used to resume the training process. This helps to minimize the loss of training iterations caused by the fault.
The MindCluster MindIO Uncorrectable Memory Error (MindIO UCE for short) detects UCEs in the on-chip memory during foundation model training, rectify the errors online, and implement step-level recomputation.
The MindCluster MindIO Air Refuelling (MindIO ARF for short) is used to restart or replace a node where an exception occurs during training, instead of restarting the entire cluster, to rectify the fault and continue the training. For some faults, you only need to restart a single process.

Benefit

The large language model (LLM) is a focal point of competition in the global science and technology industry. Typically, LLM training can take several days or even months, and checkpoint data is crucial for resuming training after interruptions. During checkpointing, training jobs in a cluster are paused. To enhance cluster utilization, the checkpointing interval is set to be relatively long, even reaching several hours. As a result, if a training job fails just before checkpoint data is generated, it can only be resumed from the last checkpoint. The training iterations between the last checkpoint and the failure need to be recalculated, causing significant loss. With MindIO TTP, checkpoint data can be generated promptly following a fault. Once the fault is rectified, the training can be resumed from the state just before the fault occurs, thereby minimizing iteration loss.

In addition, saving and loading checkpoint data for each LLM iterative training session takes a significant amount of time, similar to the time consumed for periodic checkpoint operations. With online repair of MindIO UCEs, when UCEs occur on neural processing units (NPUs), training can be restored to the state before the faults through operations such as fault clearing, fault rectification, and data rollback. This process can save the time required for stopping and restarting the training. If repair fails, TTP is then used as a backup guarantee.

MindIO TFT Architecture

The functions of MindIO TFT are integrated into a .whl package. This package can be adapted to foundation model frameworks such as MindSpeed-LLM, allowing you to use the desired functions through module import.

Key points of MindIO TFT:

MindIO TTP
- The Controller and Processor modules can detect the model training state, and the state is periodically reported to the Controller module through heartbeat messages. Once a fault is detected, the dying gasp checkpoint is saved.
- In foundation model training, the industry standard for periodically saving checkpoints often involves long intervals. If a fault occurs between the last and next scheduled save times, the required retraining can consume substantial time and resources. MindIO TTP provides a retraining program that minimizes both time and resource loss by starting the retraining process from the point of the last fault.
MindIO UCE
- Once a UCE is detected, online repair starts.
- In foundation model training, retraining consumes substantial resources, whether checkpoints are periodically saved or the dying gasp checkpoint of MindIO TFT is used. MindIO UCE enables step-level recomputation for a training framework, avoiding the need to restart the process and reducing iteration loss in continuous training. If UCE repair fails, the TTP process will be initiated.
MindIO ARF
- MindIO ARF allows you to restart or replace a node to rectify other faults and resume model training, without stopping model training.
- For service process exceptions or processor faults at the RestartRequest and RestartBusiness levels, you can trigger process-level recovery to rectify faults and resume model training.

Logical Model

Controller module: It coordinates distributed tasks, maintains the state machine (supporting process control in various scenarios), and collects the training state of each process in real-time. When an exception occurs during training, it triggers the state machine based on the exception type and sends the corresponding action to the Processor module for execution.
Processor module: It interacts with a training framework to obtain the training state of the process, reports this state to the Controller module, and executes the corresponding actions delivered by the Controller module.
Adaptor module: It adapts MindIO TTP, MindIO UCE, and MindIO ARF to a training framework. Currently, MindIO TFT has been adapted to the MindSpeed-LLM training framework. For other training frameworks, you need to adapt them as required.

Deployment Mode

Controller module: In a training cluster, only one Active Controller is supported. It is recommended that it be deployed on node 0 in the cluster. A maximum of two Backup Controllers can be automatically started.
Processor module: In a training cluster, each training process needs to start the Processor.

Parent topic: Fault Recovery Acceleration