Restoring Checkpoint Transmission on the Parameter Plane

With the dying gasp checkpoint mechanism, training rollback loss is reduced to a single step. However, during a fault, checkpoints still need to be flushed to storage drives, and upon fault tolerance and training resumption, must be reloaded, which prolongs the total recovery time. To address this, MindCluster introduces checkpoint transmission and recovery over the parameter plane.

When a fault occurs, the parameter state is retained in the device. Once fault tolerance is complete, the parameter state from the healthy card is transmitted to the fault-tolerant card via the parameter plane network, enabling rapid parameter restoration on the fault-tolerant card. Currently, this capability must be used together with process-level rescheduling and process-level online recovery.

For details about how to configure checkpoints on the parameter plane, see Restoring Parameter Passing on the Parameter Plane.