Restoring Parameter Passing on the Parameter Plane
Currently, this capability can be used only in the process-level rescheduling and process-level online recovery features. This capability is enabled by default after the process-level rescheduling and process-level online recovery features are adapted.
(Optional) Disabling Parameter Passing Recovery on the Parameter Plane
For the process-level rescheduling and process-level online recovery features, if you want to disable this function and load parameters from the storage checkpoint, you need to modify the job YAML file. The following is an example of using process-level rescheduling and disabling parameter passing recovery on the parameter plane.
...
metadata:
labels:
...
fault-scheduling: "grace"
...
...
annotations:
...
recover-strategy: "recover" # Recovery policy. recover indicates that process-level rescheduling is enabled.
...
...
spec:
replicaSpecs:
Master:
template:
spec:
containers:
- name: ascend # do not modify
...
args:
- |
...
bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
...
--distributed-optimizer-no-replica \
...
Worker:
template:
spec:
containers:
- name: ascend # do not modify
...
args:
- |
...
bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
...
--distributed-optimizer-no-replica \
...
...
distributed-optimizer-no-replica indicates whether to support periodic checkpoints for data repair, which is disabled by default. After this function is enabled, the replica optimizer does not have replicas, reducing memory usage. In process-level rescheduling and process-level online recovery scenarios, periodic checkpoints are used for repair. This function must be enabled only when process-level rescheduling or process-level online recovery is enabled.