Performance Description

Resumable training allows training to resume after a fault occurs, minimizing training losses caused by the fault. The fault recovery time of resumable training can be divided into two parts: training rollback time and training startup time, as shown in Figure 1.

Figure 1 Fault recovery phase

Training rollback time

If a training fault occurs, the original training data will be lost. In this case, you need to restore training from the saved checkpoint file. During foundation model training, saving the checkpoint file every time reduces training efficiency. Therefore, the checkpoint file is saved once an hour. When a fault occurs, the training data from the last saved checkpoint to the fault occurrence time is lost. Training rollback time is the interval between the last saved checkpoint and the fault occurrence. Assume that the average training rollback time is T0 and the checkpoint saving period is Gf. Then, T0 = Gf/2.

Training start time

If a training fault occurs, to resume training, you need to restart the training job, restore the training container and training process, reschedule resources, initialize collective communication, and load and compile the checkpoint. Training can resume only after completing the startup period. If this startup time is too long, it leads to resource wastage. Assume that the resource rescheduling time is T1, the collective communication time is T2, the checkpoint loading time is T3, and the compilation time is T4. The training start time is T1 + T2 + T3 + T4.

Total training loss time of a single fault T = T0 + T1 + T2 + T3 + T4. For details, see Reference for Training Recovery Duration.

The time required at each phase is influenced by the parameter scale and cluster size. Additionally, network and storage performance impact the overall training loss time.

Reference for Training Recovery Duration

In the following example, a single-server eight-processor job of GPT-3 under the PyTorch framework is used. The write speed of NFS storage is 2.7 GB/s, the read speed is 4.8 GB/s, and the parameter size is 3B or 15B. (The fault handling mode is rescheduling. If the graceful fault tolerance mode is used, you do not need to refer to this part.)
  • 3B parameters: As shown in Figure 2, the checkpoint flush time of the model is about 30 seconds, and resumable training takes fewer than 5 seconds in the device detection phase, fewer than 30 seconds in the device handling phase, and about 70 seconds in the training restart phase, in which checkpoint data loading takes about 3 seconds.
  • 15B parameters: As shown in Figure 3, the checkpoint flush time of the model is about 120 seconds, and resumable training takes fewer than 5 seconds in the device detection phase, fewer than 30 seconds in the device handling phase, and about 210 seconds in the training restart phase, in which checkpoint data loading takes about 90 seconds.
Figure 2 Time metrics of the model with 3B parameters
Figure 3 Time metrics of the model with 15B parameters