MindIO TFT

Application Scenario

During LLM training, saving and loading periodic checkpoint data, as well as loading data for iterative training, can be time-consuming. With MindIO TFT, checkpoint data can be generated promptly following a fault. Once the fault is rectified, the training can be resumed from the state just before the fault occurs, thereby minimizing iteration loss. MindIO UCE and MindIO ARF, based on different fault types, either perform online repairs or simply restart the faulty node, reducing the overall cluster restart time.

Component Function

MindIO TFT provides dying gasp checkpoint saving, process-level online recovery, and graceful fault tolerance functions. The details are offered as follows:

  • MindIO TTP verifies the integrity and consistency of intermediate status data following a fault during foundation model training, creates a dying gasp checkpoint, and utilizes the checkpoint to restore training, minimizing the iteration loss caused by the fault.
  • MindIO UCE detects UCE faults in the on-chip memory during foundation model training and completes online repairs to implement step-level recomputation.
  • MindIO ARF restarts or replaces a node node for repairs, allowing training to continue without the need to restart the entire cluster after an exception occurs during training.

Upstream and Downstream Dependencies

Figure 1 MindIO TFT