Configuring Periodic Checking Saving
This section describes how to save checkpoints periodically. For more details, see Saving Checkpoints Periodically.
Loading the Saved Checkpoints
Checkpoints are loaded from storage using the loading interface provided by an AI framework. To do this, specify the path of the file to be loaded. The following example demonstrates this process using the MindSpeed-LLM framework.
Add the following fields in bold to the job YAML file to enable checkpoint loading. --load controls whether to recover training processes. Once it is enabled, training process recovery takes effect.
...
spec:
replicaSpecs:
Master:
template:
spec:
containers:
- name: ascend # do not modify
args:
- |
bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
...
--load /data/ckpt/XXX \ # Checkpoint file path
...
Worker:
template:
spec:
containers:
- name: ascend # do not modify
...
args:
- |
...
bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
...
--load /data/ckpt/XXX \ # Checkpoint file path
...
...
Parent topic: Configuring Training Recovery