当前仅支持在进程级别重调度和进程级在线恢复特性中使用该能力,按照配置进程级别重调度和配置进程级在线恢复特性适配后默认开启该能力。
在进程级别重调度和进程级在线恢复特性中,如果用户想要关闭该功能,修改为从存储CKPT加载参数恢复,需修改任务YAML。以使用进程级重调度且关闭参数面传参恢复为例,示例如下。
... metadata: labels: ... process-recover-enable: "on" fault-scheduling: "force" ... ... annotations: ... recover-strategy: "recover" #任务可用恢复策略(retry:进程级在线恢复;recover:进程级别重调度;dump:保存临终遗言;exit:退出训练),四种策略可随意组合,策略之间由逗号分割 ... ... spec: replicaSpecs: Master: template: spec: containers: - name: ascend # do not modify env: - name: PROCESS_RECOVER # 开启进程级别重调度需注入该环境变量 value: "on" args: - | ... export ELASTIC_PROCESS_RECOVER_ENABLE=1; ... bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \ ... --enable-high-availability \ --enable-worker-reboot \ --distributed-optimizer-no-replica \ ... Worker: template: spec: containers: - name: ascend # do not modify env: - name: PROCESS_RECOVER # 开启进程级别重调度需注入该环境变量 value: "on" args: - | ... export ELASTIC_PROCESS_RECOVER_ENABLE=1; ... bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \ ... --enable-high-availability \ --enable-worker-reboot \ --distributed-optimizer-no-replica \ ... ...
distributed-optimizer-no-replica:数据修复支持周期CKPT功能开关,默认关闭,配置后副本优化器无副本,减小内存占用,在进程级别重调度和进程级在线恢复场景下,使用周期CKPT进行修复。本开关需开启进程级别重调度或进程级在线恢复。