Saving Dying Gasp Checkpoints

While asynchronous checkpoint saving minimizes the checkpoint interval and fault-related loss, it still incurs overhead, making sub-second fault loss reduction challenging. MindCluster introduces dying gasp checkpoint saving, preserving the initial parameter state of the current step upon fault occurrence, effectively reducing status loss to less than one step.

MindCluster MindIO Try To Persist (MindIO TTP for short) provides the dying gasp checkpoint capability, enabling users to preserve dying gasp checkpoints when a fault occurs.

For details about how to save dying gasp checkpoints, see Fault Recovery Acceleration.

For details about how to configure dying gasp checkpoint saving, see Configuring Dying Gasp Checkpoint Saving.

Function Adaptation Points

In dying gasp checkpoints, the framework initializes the MindIO service. After the service is started, the optimizer updates the corresponding status to MindIO. Then, a DP replica group and optimizer replicas are created to ensure redundant backup of model parameters. When an exception occurs, the decorator is used to capture fault modes. Then, operator resources are cleared, and dying gasp checkpoints are saved based on replicas.

For non-MindSpeed-LLM users, adapt the following functions as listed in Table 1.

Table 1 Functions adapted for dying gasp checkpoint saving

Function

Description

Reference Link

Boot while initialization

The MindIO service is started while a training framework is initialized.

Adapting to non-MindSpeed-LLM Framework

Optimizer update status reporting

Before optimizer update, the start and end of the update process are reported.

DP replica group creation

The creation logic of dp_cp/dp_ep replica groups and gloo groups is added. The replica groups are created after native Megatron distributed parallel groups are created.

Optimizer replica

The functions of the native Megatron optimizer are inherited, with MindIO optimizer replica management logic embedded.

Exception capture decorator

The decorator is used to decorate the train function to capture fault modes.

Operator resource clearing

A callback function is used to clear operator resources and restore the operator delivery capability.

Dying gasp checkpoint

Dying gasp checkpoints are saved via a callback function and optimizer replica dump method.