Saving Checkpoints Periodically

Currently, training data (such as model parameters) is saved as checkpoints to implement large-scale cluster training. When a service platform detects a fault, it can terminate the current training job and reload the saved checkpoints to resume training from the time when checkpoints are saved, avoiding a complete restart.

Periodic checkpoint saving consists of two parts: asynchronous checkpoint saving and memory checkpoint loading.

Asynchronous checkpoint saving
MindIO ACP provides the capability of asynchronous saving checkpoints at a fixed interval. If MindIO ACP is not used, the parameters to be saved need to be copied from the device to the host and then flushed to storage, which takes several minutes. MindIO ACP enables asynchronous flushing, allowing parameters to be written to storage in the background after being copied to the host from the device, without blocking the ongoing training process. This allows training to proceed uninterrupted during the flushing phase.

Memory checkpoint loading
MindIO ACP provides the capability of periodically loading checkpoints based on memory. During training recovery, periodic checkpoints that are saved previously need to be loaded from storage to restore the training status and resume training. However, checkpoint loading within a large model typically takes several minutes due to data volume and storage performance constraints. To accelerate this process, MindIO ACP introduces a periodic memory-based checkpoint loading mechanism. In the event of a fault, checkpoints are loaded directly from memory, significantly reducing recovery time.

Recommended Configuration

When using the checkpoint saving capability of the rescheduling upon faults feature, select a frequency for periodically saving checkpoints based on your actual requirements. Figure 1 illustrates recommended frequencies.

Figure 1 Recommended frequencies for periodic checkpoint saving

When periodic checkpoint recovery is enabled, any training progress between the last saved checkpoint and the fault event will be lost upon recovery. To minimize this loss, you can reduce the interval between checkpoint saving. However, each saving operation interrupts training while checkpoints are flushed from the device to storage, incurring training time wastes. As a result, shorter intervals lead to wasted training time and status loss. Therefore, assuming the time to save checkpoints remains constant, a trade-off must be made between minimizing training loss and avoiding loss caused by faults.

To solve this problem, the single saving time needs to be reduced. However, this duration is largely influenced by the volume of data being saved and the performance of the storage system, both of which are typically difficult to optimize. Therefore, MindIO ACP is introduced to solve the problem of high loss during periodic checkpoint recovery.

Parent topic: Training Recovery