Configuring Periodic Checking Saving

This section describes how to save checkpoints periodically. For more details, see Saving Checkpoints Periodically.

Loading the Saved Checkpoints

Checkpoints are loaded from storage using the loading interface provided by an AI framework. To do this, specify the path of the file to be loaded. The following example demonstrates this process using the MindSpeed-LLM framework.

Add the following fields in bold to the job YAML file to enable checkpoint loading. --load controls whether to recover training processes. Once it is enabled, training process recovery takes effect.

...
spec:
  replicaSpecs:
    Master:
      template:
        spec:
          containers:
          - name: ascend # do not modify
            args:
              - | 
                bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                  ...
                  --load /data/ckpt/XXX \  # Checkpoint file path
                  ...
    Worker:
      template:
        spec:
          containers:
          - name: ascend # do not modify
            ...
            args:
              - |
                ...
                bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                  ...
                --load /data/ckpt/XXX \    # Checkpoint file path
                  ...
...