昇腾社区首页
中文
注册

配置进程级别重调度

本章节将指导用户了解配置进程级别重调度的关键步骤。进程级别重调度的特性介绍、使用约束、支持的产品型号及原理请参见进程级别重调度

构建镜像

使用Dockerfile构建容器镜像,新增启动命令。

# MindCluster无损失断点续训适配脚本,MINDX_ELASTIC_PKG为Elastic Agent whl安装包的路径,MINDIO_TTP_PKG为MindIO的whl安装包的路径,请根据实际情况填写  
RUN pip3 install $MINDX_ELASTIC_PKG  
RUN pip3 install $MINDIO_TTP_PKG 
    
 
# 可选,使用优雅容错、Pod级别重调度或进程级别重调度时必须配置以下命令
RUN sed -i '/import logging/i import mindx_elastic.api' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py

# 可选,MindSpore框架下,使用进程级别重调度需配置以下命令
RUN pip install $TASKD_WHL

准备任务YAML

在任务YAML中,新增以下字段,开启进程级别重调度。其中process-recover-enable是训练进程恢复的统一开关,打开后训练进程恢复才生效。recover-strategy是训练进程恢复使用的策略,其中的recover代表开启进程级别恢复。

目前进程级别重调度支持以下2种方式,用户可根据实际使用场景,选择其中一种方式进行使用。

  • 方式一:故障后迁移故障Pod到健康节点
    ...  
    metadata: 
       labels:  
         ...  
         process-recover-enable: "on"  
         fault-scheduling: "force"
     ... 
    ...  
       annotations:  
         ...  
         recover-strategy: "recover"   # 任务可用恢复策略(retry:进程级在线恢复;recover:进程级别重调度;recover-in-place: 进程级原地恢复;dump:保存临终遗言;exit:退出训练),5种策略可随意组合,策略之间由逗号分割
     ... 
    ...
    spec:
      replicaSpecs:
        Master:
          template:
            spec:
              containers:
              - name: ascend       # do not modify
                env:
                  - name: PROCESS_RECOVER         # 注入该环境变量以启用进程级别重调度功能(具体策略由recover-strategy指定)
                    value: "on"
                args:
                  - | 
                    ...
                    export ELASTIC_PROCESS_RECOVER_ENABLE=1;
                    ... 
                    bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                      ...
                      --enable-high-availability \
                      --enable-worker-reboot \
                      ...
        Worker:
          template:
            spec:
              containers:
              - name: ascend # do not modify
                env:
                  - name: PROCESS_RECOVER         # 注入该环境变量以启用进程级别重调度功能(具体策略由recover-strategy指定)
                    value: "on"
                args:
                  - |
                    ...
                    export ELASTIC_PROCESS_RECOVER_ENABLE=1;
                    ...
                    bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                      ...
                      --enable-high-availability \
                      --enable-worker-reboot \
                      ...
    ...
  • 方式二:故障后不迁移故障Pod,仅重启故障进程
    ...  
    metadata: 
       labels:  
         ...  
         process-recover-enable: "on"  
         fault-scheduling: "force"
     ... 
    ...  
       annotations:  
         ...  
         recover-strategy: "recover-in-place"   # 任务可用恢复策略(retry:进程级在线恢复;recover:进程级别重调度;recover-in-place: 进程级原地恢复;dump:保存临终遗言;exit:退出训练),5种策略可随意组合,策略之间由逗号分割
     ... 
    ...
    spec:
      replicaSpecs:
        Master:
          template:
            spec:
              containers:
              - name: ascend # do not modify
                env:
                  - name: PROCESS_RECOVER         # 注入该环境变量以启用进程级别重调度功能(具体策略由recover-strategy指定)
                    value: "on"
                args:
                  - | 
                    ...
                    export ELASTIC_PROCESS_RECOVER_ENABLE=1;
                    export ENABLE_RESTART_FAULT_PROCESS=on;
                    ... 
                    bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                      ...
                      --enable-high-availability \
                      --enable-worker-reboot \
                      ...
        Worker:
          template:
            spec:
              containers:
              - name: ascend # do not modify
                env:
                  - name: PROCESS_RECOVER         # 注入该环境变量以启用进程级别重调度功能(具体策略由recover-strategy指定)
                    value: "on"
                args:
                  - |
                    ...
                    export ELASTIC_PROCESS_RECOVER_ENABLE=1;
                    export ENABLE_RESTART_FAULT_PROCESS=on;
                    ...
                    bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                      ...
                      --enable-high-availability \
                      --enable-worker-reboot \
                      ...
    ...

适配训练脚本

(可选)用户可以在启动训练的shell脚本(例如train_start.sh)中,导入环境变量及新增max_restarts和monitor_interval参数,示例如下。

...
export ELASTIC_PROCESS_RECOVER_ENABLE=1;   # 配置进程级别重调度需导入本环境变量
export ENABLE_RESTART_FAULT_PROCESS=on;    # 配置进程级原地恢复需导入本环境变量,若恢复策略为recover时无需导入

……
   logger "server id is: ""${server_id}" 
   if [ "${framework}" == "PyTorch" ]; then 
     get_env_for_pytorch_multi_node_job 
     DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT  --monitor_interval 10"
 ...