Performance Description
Resumable training allows training to resume after a fault occurs, minimizing training losses caused by the fault. The fault recovery time of resumable training can be divided into two parts: training rollback time and training startup time, as shown in Figure 1.
Training rollback time
If a training fault occurs, the original training data will be lost. In this case, you need to restore training from the saved checkpoint file. During foundation model training, saving the checkpoint file every time reduces training efficiency. Therefore, the checkpoint file is saved once an hour. When a fault occurs, the training data from the last saved checkpoint to the fault occurrence time is lost. Training rollback time is the interval between the last saved checkpoint and the fault occurrence. Assume that the average training rollback time is T0 and the checkpoint saving period is Gf. Then, T0 = Gf/2.
Training start time
If a training fault occurs, to resume training, you need to restart the training job, restore the training container and training process, reschedule resources, initialize collective communication, and load and compile the checkpoint. Training can resume only after completing the startup period. If this startup time is too long, it leads to resource wastage. Assume that the resource rescheduling time is T1, the collective communication time is T2, the checkpoint loading time is T3, and the compilation time is T4. The training start time is T1 + T2 + T3 + T4.
Total training loss time of a single fault T = T0 + T1 + T2 + T3 + T4. For details, see Reference for Training Recovery Duration.
The time required at each phase is influenced by the parameter scale and cluster size. Additionally, network and storage performance impact the overall training loss time.
Reference for Training Recovery Duration
- 3B parameters: As shown in Figure 2, the checkpoint flush time of the model is about 30 seconds, and resumable training takes fewer than 5 seconds in the device detection phase, fewer than 30 seconds in the device handling phase, and about 70 seconds in the training restart phase, in which checkpoint data loading takes about 3 seconds.
- 15B parameters: As shown in Figure 3, the checkpoint flush time of the model is about 120 seconds, and resumable training takes fewer than 5 seconds in the device detection phase, fewer than 30 seconds in the device handling phase, and about 210 seconds in the training restart phase, in which checkpoint data loading takes about 90 seconds.


