昇腾社区首页
EN
注册

适配示例

本章节将以单台Atlas 800T A2 训练服务器为例,指导用户step by step地完成断点续训的适配步骤。

PyTorch场景适配示例(基于MindSpeed-LLM)

训练代码与数据集准备,可以参考MindSpeed-LLM 使用指南

  1. 准备MindSpeed-LLM训练代码,脚本如下。
    mkdir -p /data/atlas_dls/public/code
    cd /data/atlas_dls/public/code
    git clone https://gitee.com/ascend/MindSpeed-LLM.git 
    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd Megatron-LM
    git checkout core_r0.6.0
    cp -r megatron ../MindSpeed-LLM/
    cd ..
    cd MindSpeed-LLM
    git checkout 1.0.0
     
    ## 准备MindSpeed源码,制作镜像的时候已经准备过了,可以直接拷贝,也可以重新拉取
    git clone https://gitee.com/ascend/MindSpeed.git
    cd MindSpeed
    git checkout 969686ff
    cd ..
    
    ## 创建必要的文件夹,后续使用
    mkdir alllogs
    mkdir dataset
    mkdir scripts
    mkdir yamls
    mkdir output
     
    ## 重命名MindSpeed-LLM为LLAMA2_for_PyTorch_2.1_code
    cd ..
    mv MindSpeed-LLM LLAMA2_for_PyTorch_2.1_code
  2. 准备llama2-7b模型词表和数据集,执行如下脚本。
    cd LLAMA2_for_PyTorch_2.1_code
    mkdir ./dataset/llama-2-7b-hf/
    cd ./dataset/llama-2-7b-hf/
    # 可以基于网页直接下载,也可以基于命令行下载
    wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/tokenizer.json 
    wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/tokenizer.model 
    wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/tokenizer_config.json 
    wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/config.json
    wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/generation_config.json
    wget https://huggingface.co/daryl149/llama-2-7b-hf/resolve/main/special_tokens_map.json
  3. 自行准备LLAMA2对应的llama2-7b数据集,示例使用enwiki20230101,执行如下脚本。
    ## 准备数据集
    cd LLAMA2_for_PyTorch_2.1_code/dataset/
    # 可以基于网页直接下载,也可以基于命令行下载
    wget https://huggingface.co/datasets/lsb/enwiki20230101/resolve/main/data/train-00000-of-00042-d964455e17e96d5a.parquet
  4. 预处理数据集,执行如下脚本。
    ## 预训练数据集处理方法,本步骤需要启动mindspeed-dl:v1,挂载第一步准备好的“LLAMA2_for_PyTorch_2.1_code”目录,在容器中执行
    docker run -it -v /data/atlas_dls/public/code/LLAMA2_for_PyTorch_2.1_code:/home/LLAMA2_for_PyTorch_2.1_code -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -e ASCEND_VISIBLE_DEVICES=0-7 mindspeed-dl:v1 /bin/bash
    ## 在容器中执行如下命令,预处理数据集
    cd /home/LLAMA2_for_PyTorch_2.1_code
    python ./preprocess_data.py \
        --input ./dataset/train-00000-of-00042-d964455e17e96d5a.parquet \
        --tokenizer-name-or-path ./dataset/llama-2-7b-hf \
        --tokenizer-type PretrainedFromHF \
        --handler-name GeneralPretrainHandler \
        --output-prefix ./dataset/enwiki \
        --json-keys text \
        --workers 8 \
        --log-interval 1000

    如果出现关于silu函数的报错,将LLAMA2_for_PyTorch_2.1_code/megatron/core/fusions/fused_bias_swiglu.py中的@jit_fuser注释掉再执行。

  5. 进入“MindXDL-deploy”仓库,根据MindXDL-deploy开源仓版本说明进入版本对应分支,获取“samples/train/resumable-training/fault-tolerance/without-ranktable/pytorch/llama2”目录下的train_start.sh文件,在管理节点构造成如下的目录结构。
    root@ubuntu:/data/atlas_dls/public/code/LLAMA2_for_PyTorch_2.1_code/scripts#
    scripts/
    └── train_start.sh
  6. 配置训练启动脚本train_start.sh,请根据实际情况进行修改。
    # 开启Elastic Agent侧进程级别重调度、进程级在线恢复、临终CheckPoint恢复功能
    export ELASTIC_PROCESS_RECOVER_ENABLE=1
    
    # 物理机上可以通信的网口,根据主节点高速网卡实际情况进行配置,如任务YAML中配置hostNetwork为false,则设置为eth0。示例基于Atlas 800T A2 训练服务器,如果使用的其他设备,请根据实际情况修改
    export GLOO_SOCKET_IFNAME=enp189s0f0
    # 如任务YAML中配置hostNetwork为false,则设置为eth0。示例基于Atlas 800T A2 训练服务器,如果使用的其他设备,请根据实际情况修改              
    export HCCL_SOCKET_IFNAME=enp189s0f0           
    
    # 设置ckpt保存目录,注意ckpt、日志文件等应挂载到宿主机
    LOAD_CHECKPOINT_PATH=/job/code/output/ckpt
    # 设置ckpt保存目录
    SAVE_CHECKPOINT_PATH=/job/code/output/ckpt 
     
    # 配置数据集路径
    DATA_PATH=/job/code/data/enwiki_text_document 
     
    # 配置词表路径
    TOKENIZER_MODEL=/job/code/model_from_hf/llama-2-7b-hf/tokenizer.model
     
    #(可选)自定义配置Elastic Agent运行日志的落盘路径
    mkdir -p /job/code/alllogs/$MINDX_TASK_ID/elasticlogs/elastic-log$XDL_IP-$RANK                # MINDX_TASK_ID 为训练任务ID,使用XDL_IP、RANK区分不同节点Elastic Agent日志
    export ELASTIC_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/elasticlogs/elastic-log$XDL_IP-$RANK               # 输入Elastic Agent组件运行日志的落盘路径
    #(可选)配置组件间grpc通信使用安全链接
    export ELASTIC_GRPC_SECURE_CONNECT=on                  # 安全链接开关,取值为"on"时,表示打开配置
    export ELASTIC_GRPC_SECURE_CERTIFICATES_PATH=/usr/security/cert   # 安全证书地址,请将/usr/security/cert替换为有效的安全证书地址
    rt 
    • 若训练任务YAML中hostNetwork参数值为false,则需要将train_start.sh中的export GLOO_SOCKET_IFNAME的值改设置为eth0,其他代码保持不变。示例如下:
    export GLOO_SOCKET_IFNAME=eth0     # eth0是容器内可以通信的网口
    export HCCL_SOCKET_IFNAME=eth0
    • 无安全需求场景,无需配置组件间gRPC通信安全链接参数ELASTIC_GRPC_SECURE_CONNECT。安全链接开关打开,请同时配置安全证书地址;安全链接开关关闭,安全证书地址无需配置。
  7. 配置包含多种重调度级别的训练任务。获取训练任务YAML,该YAML中已经配置了Pod级别重调度、进程级别重调度、进程级在线恢复等。然后根据实际情况配置挂载卷的服务器IP地址等配置。
    cd LLAMA2_for_PyTorch_2.1_code/yamls
    wget https://gitee.com/ascend/mindxdl-deploy/raw/master/samples/train/resumable-training/fault-tolerance/without-ranktable/pytorch/llama2/yamls/pytorch_multinodes_acjob_910b.yaml

    进程级别重调度、进程级在线恢复等训练进程级别的恢复与优雅容错不可同时存在。优雅容错的配置步骤请参见优雅容错模式

  8. (可选)如需使用进程级别重调度或进程级在线恢复,请在LLAMA2_for_PyTorch_2.1_code/mindspeed_llm/training/training.py代码中增加如下加粗内容
    1. 按“Esc”键,输入:wq!,按“Enter”保存并退出编辑。

    torch_npu版本为7.0.RC1及以上版本时,无需执行本步骤。

     ... 
     
    class CustomFunction(torch.autograd.Function): 
      @staticmethod 
      def forward(ctx, input): 
          torch.cuda.set_stream(torch.cuda.default_stream()) 
          return input 
     
      @staticmethod 
      def backward(ctx, grad): 
          torch.cuda.set_stream(torch.cuda.default_stream()) 
          return grad 
     
    def streamHandler(): 
        input_tensor = torch.empty(1, dtype=torch.float32, device="npu", requires_grad=True) 
        grad_tensor = torch.empty(1, dtype=torch.float32, device="npu", requires_grad=True) 
        output_tensor = CustomFunction.apply(input_tensor) 
        output_tensor.backward(grad_tensor) 
     
    def pretrain(train_valid_test_dataset_provider, 
    ... 
        if args.do_train and args.train_iters > 0: 
                if args.enable_high_availability: 
                    from mindio_ttp.adaptor import tft_register_processor, tft_train 
                    from mindio_ttp.framework_ttp import tft_register_set_stream_handler 
                    tft_register_set_stream_handler(streamHandler) 
                    tft_register_processor(train_valid_test_dataset_provider, model_provider, model_type) 
                    iteration, num_floating_point_operations_so_far = tft_train(train_args, test_data_iterator_list) 
                else: 
                    iteration, num_floating_point_operations_so_far = train(*train_args) 
    ...
  9. (可选)如需使用进程级别重调度或进程级在线恢复,请在LLAMA2_for_PyTorch_2.1_code/mindspeed_llm/training/initialize.py文件下增加如下加粗内容。
    1. 打开“LLAMA2_for_PyTorch_2.1_code/mindspeed_llm/training/initialize.py”文件。
      vim mindspeed_llm/training/initialize.py
    2. 按“i”进入编辑模式,添加如下加粗代码。
      # 跳转到new_group_wrapper函数定义,对加粗处修改:
      def new_group_wrapper(fn):
          @wraps(fn)
          def wrapper(*args, **kwargs):
              backend = kwargs.get('backend', None)
              from mindio_ttp.adaptor import tft_is_arf_reboot_node
              if tft_is_arf_reboot_node() and isinstance(backend, str) and 'gloo' in backend:
                  return None
      
              if (backend is None) or torch.distributed.distributed_c10d._is_barrier_after_init():
                  kwargs['use_local_synchronization'] = True
      
              res = fn(*args, **kwargs)
              return res
      
          return wrapper

MindSpore场景适配示例(基于MindFormers)

训练代码与数据集准备,可以参考MindFormers文档

  1. 准备mindformers代码仓,执行如下命令。
    mkdir -p /data/atlas_dls/public/code
    git clone https://gitee.com/mindspore/mindformers.git
    cd mindformers
    git checkout 9d40ae10b7cdf5f8ac0a7103c435b8fd6e59b999
    mkdir dataset
    mkdir yamls
    cd ..
     
    # 将mindformers重命名为LLAMA2_for_MS_code
    mv mindformers LLAMA2_for_MS_code
  2. 准备数据集,示例使用7b数据集。
    cd LLAMA2_for_MS_code/dataset
    wget https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/dataset/wikitext-2/wikitext-2-v1.zip
    wget https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/MindFormers/llama2/tokenizer.model
    unzip wikitext-2-v1.zip
  3. 预处理数据集。
    ## 预训练数据集处理方法,本步骤需要启动mindformers-dl:v1,挂载第一步准备好的“LLAMA2_for_MS_code”目录,在容器中执行
    docker run -it -v /data/atlas_dls/public/code/LLAMA2_for_MS_code:/home/LLAMA2_for_MS_code -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -e ASCEND_VISIBLE_DEVICES=0-3 mindformers-dl:v1 /bin/bash
    ## 在容器中执行如下命名,预处理数据集
    cd /home/LLAMA2_for_MS_code/mindformers/tools/dataset_preprocess/llama/
    python llama_preprocess.py \
    --dataset_type wiki \
    --input_glob /home/LLAMA2_for_MS_code/dataset/wikitext-2/wiki.train.tokens \
    --model_file /home/LLAMA2_for_MS_code/dataset/tokenizer.model \
    --seq_length 4096 \
    --output_file /home/LLAMA2_for_MS_code/dataset/wiki4096.mindrecord

    在执行以上步骤时,如果出现报错:from mindformers.tools.dataset_preprocess.llama.conversation import get_default_conv_template ModuleNotFoundError:No module named 'mindformers.tools.dataset preprocess',需要按照以下步骤进行处理。

    1. 执行以下命令启动容器。
      docker run -it -v /data/atlas_dls/public/code/LLAMA2_for_MS_code:/home/LLAMA2_for_MS_code -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -e ASCEND_VISIBLE_DEVICES=0-3 mindformers-dl:v1 /bin/bash
    2. 将/home/LLAMA2_for_MS_code/mindformers/tools/dataset_preprocess/llama/llama_preprocess.py文件内的字符串from mindformers.tools.dataset_preprocess.llama.conversation import get_default_conv_template改成from conversation import get_default_conv_template。
    3. 执行pip uninstall mindformers命令,卸载mindformers。
    4. 在LLAMA2_for_MS_code目录下执行bash build.sh命令,重新安装MindFormers。
    5. 再执行数据预处理命令。
      cd /home/LLAMA2_for_MS_code/mindformers/tools/dataset_preprocess/llama/ 
      python llama_preprocess.py \ 
      --dataset_type wiki \ 
      --input_glob /home/LLAMA2_for_MS_code/dataset/wikitext-2/wiki.train.tokens \ 
      --model_file /home/LLAMA2_for_MS_code/dataset/tokenizer.model \ 
      --seq_length 4096 \ 
      --output_file /home/LLAMA2_for_MS_code/dataset/wiki4096.mindrecord
  4. 编辑启动脚本LLAMA2_for_MS_code/scripts/msrun_launcher.sh文件,配置日志路径、通信网口等。
    #!/bin/bash
    # Copyright 2024 Huawei Technologies Co., Ltd
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    # http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    # ============================================================================
    
    # msrun Default Parameters
    
    # 各日志等级及路径按需配置
    export HCCL_ASYNC_ERROR_HANDLING=0
    # export ASCEND_GLOBAL_LOG_LEVEL=1
    export LOGLEVEL=DEBUG
    export GLOG_v=2
    export GLOG_log_dir=/job/code/alllogs/${MINDX_TASK_ID}/traininglog/msrun
    # export HCCL_ENTRY_LOG_ENABLE=1
    export HCCL_CONNECT_TIMEOUT=600
    
    # 物理机上可以通信的网口,根据主节点高速网卡实际情况进行配置,如任务YAML中配置hostNetwork为false,则设置为eth0。示例基于Atlas 800T A2 训练服务器,如果使用的其他设备,请根据实际情况修改
    export GLOO_SOCKET_IFNAME=enp189s0f0 
    # 如任务YAML中配置hostNetwork为false,则设置为eth0。示例基于Atlas 800T A2 训练服务器,如果使用的其他设备,请根据实际情况修改    
    export HCCL_SOCKET_IFNAME=enp189s0f0 
    # 配置集合通信起始端口,预防该端口被占用
    export HCCL_IF_BASE_PORT=64000   
    
    export PROCESS_RECOVER="on"      # 进程级别重调度及进程级在线恢复Elastic Agent侧开关
    export ELASTIC_PROCESS_RECOVER_ENABLE=1        # 使能环境变量使得taskd能与clusterd通信
    export MINDIO_FOR_MINDSPORE=1                  # 在mindspore场景下使能mindio
    export MS_ENABLE_TFT='{TTP:1,UCE:1,ARF:1}'     # 分别开启临终遗言、进程级在线恢复、进程级重调度
    export MS_TFT_IP=$MS_SCHED_HOST                # 配置mindspore所用mindio controller地址
    export MS_TFT_PORT=8000                        # 配置mindspore所用mindio controller端口
    
    # 以任务id分类,生成各类日志文件夹
    mkdir -p /job/code/alllogs/${MINDX_TASK_ID}
    mkdir -p /job/code/alllogs/${MINDX_TASK_ID}/traininglog/log-print/
    export LOG_MF_PATH=/job/code/alllogs/${MINDX_TASK_ID}/traininglog/mf/log$MF_LOG_SUFFIX
    # Add the suffix to the msrun_log
    LOG_DIR=/job/code/alllogs/${MINDX_TASK_ID}/traininglog/log-output/$MF_LOG_SUFFIX
    export ASCEND_PROCESS_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/plogs/$MS_NODE_RANK     #设置plog落盘路径
    export TTP_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/ttplogs/$MS_NODE_RANK        
    export TRAIN_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/trainlogs/$MS_NODE_RANK
    
    source /usr/local/Ascend/ascend-toolkit/set_env.sh
    
    WORKER_NUM=$MS_WORKER_NUM
    LOCAL_WORKER=$MS_LOCAL_WORKER
    MASTER_ADDR=$MS_SCHED_HOST
    MASTER_PORT=$MS_SCHED_PORT
    NODE_RANK=$MS_NODE_RANK
    JOIN="True"
    CLUSTER_TIME_OUT=7200
    # export HCCL_BUFFSIZE=2 # HCCL memory usage
    
    # Set PYTHONPATH
    MF_SCRIPTS_ROOT=$(realpath "$(dirname "$0")")
    export PYTHONPATH=$MF_SCRIPTS_ROOT/../:$PYTHONPATH
    
    # Set the log suffix
    if [ -z "${MF_LOG_SUFFIX+x}" ] || [ "$MF_LOG_SUFFIX" == "" ]
    then
      MF_LOG_SUFFIX=$MF_LOG_SUFFIX
    else
      MF_LOG_SUFFIX=_$MF_LOG_SUFFIX
    fi
    
    # get the workspcace path
    WORKSPACE_PATH=$(pwd)
    
    # Add the suffix to the MF_LOG
    
    # Set the PLOG path
    if [ -z "${PLOG_REDIRECT_TO_OUTPUT+x}" ] || [ $PLOG_REDIRECT_TO_OUTPUT == False ]
    then
      echo "No change the path of plog, the path of plog is /root/ascend"
    else
      export ASCEND_PROCESS_LOG_PATH=$WORKSPACE_PATH/output/plog$MF_LOG_SUFFIX
      echo "PLOG_REDIRECT_TO_OUTPUT=$PLOG_REDIRECT_TO_OUTPUT, set the path of plog to $ASCEND_PROCESS_LOG_PATH"
    fi
    
    if [ $# != 1 ] && [ $# != 2 ] && [ $# != 6 ] && [ $# != 9 ]
    then
      echo "Usage Help: bash msrun_launcher.sh [EXECUTE_ORDER] For Default 8 Devices In Single Machine"
      echo "Usage Help: bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] For Quick Start On Multiple Devices In Single Machine"
      echo "Usage Help: bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [MASTER_PORT] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT] For Multiple Devices In Single Machine"
      echo "Usage Help: bash msrun_launcher.sh [EXECUTE_ORDER] [WORKER_NUM] [LOCAL_WORKER] [MASTER_ADDR] [MASTER_PORT] [NODE_RANK] [LOG_DIR] [JOIN] [CLUSTER_TIME_OUT] For Multiple Devices In Multiple Machines"
      exit 1
    fi
    
    # Start Without Parameters For 8 Devices On Single Machine
    if [ $# == 1 ]
    then
      echo "No parameter is entered. Notice that the program will run on default 8 cards. "
      SINGLE_NODE=false
    else
      WORKER_NUM=$MS_LOCAL_WORKER
    fi
    
    # Check WORKER_NUM
    if [[ ! $WORKER_NUM =~ ^[0-9]+$ ]]; then
        echo "error: worker_num=$WORKER_NUM is not a number"
        exit 1
    fi
    
    # Quick Start For Multiple Devices On Single Machine
    if [ $# == 2 ]
    then
      LOCAL_WORKER=$WORKER_NUM
      SINGLE_NODE=true
    fi
    
    # Multiple Devices On Single Machine
    if [ $# == 6 ]
    then
      LOCAL_WORKER=$WORKER_NUM
      MASTER_PORT=$3
      LOG_DIR=$4
      JOIN=$5
      CLUSTER_TIME_OUT=$6
    
      SINGLE_NODE=true
    fi
    
    # Multiple Devices On Multiple Machine
    if [ $# == 9 ]
    then
      LOCAL_WORKER=$3
      MASTER_ADDR=$4
      MASTER_PORT=$5
      NODE_RANK=$6
      LOG_DIR=$7
      JOIN=$8
      CLUSTER_TIME_OUT=$9
    
      if [ $WORKER_NUM == $LOCAL_WORKER ]
      then
        echo "worker_num is equal to local_worker, Notice that task will run on single node."
        SINGLE_NODE=true
      else
        echo "worker_num=$WORKER_NUM, local_worker=$LOCAL_WORKER, \
         Please run this script on other nodes with different node_rank."
        SINGLE_NODE=false
      fi
    fi
    
    # Init msrun Command
    if [ $SINGLE_NODE == true ]
    then
      MSRUN_CMD="msrun --worker_num=$WORKER_NUM \
       --local_worker_num=$LOCAL_WORKER \
       --master_port=$MASTER_PORT \
       --log_dir=$LOG_DIR \
       --join=$JOIN \
       --cluster_time_out=$CLUSTER_TIME_OUT"
    else
      MSRUN_CMD="msrun --worker_num=$WORKER_NUM \
       --local_worker_num=$LOCAL_WORKER \
       --master_addr=$MASTER_ADDR \
       --master_port=$MASTER_PORT \
       --node_rank=$NODE_RANK \
       --log_dir=$LOG_DIR \
       --join=$JOIN \
       --cluster_time_out=$CLUSTER_TIME_OUT"
    fi
    
    EXECUTE_ORDER="$MSRUN_CMD $1 2>&1 |& tee  -a /job/code/alllogs/${MINDX_TASK_ID}/traininglog/log-print/node-$MS_NODE_RANK"
    
    
    ulimit -u unlimited
    
    echo "Running Command: $EXECUTE_ORDER"
    echo "Please check log files in ${WORKSPACE_PATH}/${LOG_DIR}"
    
    
    function check_return_code() {
        ret_code=$?
        if [[ ${ret_code} -ne 0 ]]; then
          logger "running job failed. exit code: ${ret_code}" | tee -a ${output_url}/log
          exit ${ret_code}
        fi
    }
    CKPT_PATH="./ouput/checkpoint"
    if [ -d "${CKPT_PATH}" ]
    then
        msrun --worker_num=$WORKER_NUM \
             --local_worker_num=$LOCAL_WORKER \
             --master_addr=$MASTER_ADDR \
             --master_port=$MASTER_PORT \
             --node_rank=$NODE_RANK \
             --log_dir=$LOG_DIR \
             --join=$JOIN \
             --cluster_time_out=$CLUSTER_TIME_OUT $1 --load_checkpoint="${CKPT_PATH}" --resume_training=true  2>&1  |& tee -a /job/code/alllogs/${MINDX_TASK_ID}/traininglog/log-print/node-$MS_NODE_RANK
    else  
         msrun --worker_num=$WORKER_NUM \
             --local_worker_num=$LOCAL_WORKER \
             --master_addr=$MASTER_ADDR \
             --master_port=$MASTER_PORT \
             --node_rank=$NODE_RANK \
             --log_dir=$LOG_DIR \
             --join=$JOIN \
             --cluster_time_out=$CLUSTER_TIME_OUT $1 --load_checkpoint="" --resume_training=false  2>&1  |& tee -a /job/code/alllogs/${MINDX_TASK_ID}/traininglog/log-print/node-$MS_NODE_RANK
    fi
    
    ST=${PIPESTATUS[0]}
    if [[ ${ST} -ne 0 ]]; then
        echo "process exit with exitcode:${ST}"
        logger "running job failed. exit code: $ret" | tee -a ${output_url}/log
        exit ${ST}
    fi
  5. 修改模型参数配置YAML。打开LLAMA2_for_MS_code/configs/llama2/pretrain_llama2_7b.yaml文件,配置数据集路径、分布式并行参数等。请根据实际需要修改训练参数配置YAML中各字段。(注:使用Pod级重调度、进程级重调度、进程级在线恢复特性需要保证启动时CKPT目录下存在已保存的断点ckpt文件,且在训练使用的配置YAML中设置resume_training=true和load_checkpoint={CKPT目录})
    @@ -15,7 +15,7 @@ trainer:
    
     # runner config
     runner_config:
    -  epochs: 2
    +  epochs: 200
       batch_size: 1
       sink_mode: True
       sink_size: 1
    @@ -88,13 +88,14 @@ parallel:
       parallel_optimizer_config:
         gradient_accumulation_shard: False
         parallel_optimizer_threshold: 64
    +    optimizer_weight_shard_size: 2
     # default parallel of device num = 8 for Atlas 800T A2
     parallel_config:
    -  data_parallel: 8
    +  data_parallel: 64
       model_parallel: 1
       pipeline_stage: 1
       use_seq_parallel: False
    -  micro_batch_num: 1
    +  micro_batch_num: 16
       vocab_emb_dp: True
       gradient_aggregation_group: 4
     # when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
    @@ -128,6 +129,8 @@ context:
       save_graphs: False
       save_graphs_path: "./graph"
       device_id: 0
    +  #ascend_config:
    +  #  parallel_speed_up_json_path: "/home/lby/workspace/mindformers/parallel_speed_up.json"
       graph_kernel_flags: "--disable_pass=cluster.floatstatus_fusion,preprocess.depend_elimination"
     # model config
     model:
    @@ -136,7 +139,7 @@ model:
         batch_size: 1 # add for increase predict
         seq_length: 4096
         hidden_size: 4096
    -    num_layers: 32
    +    num_layers: 4
         num_heads: 32
         vocab_size: 32000
         multiple_of: 256
  6. 拷贝LLAMA2_for_MS_code/configs/llama2/pretrain_llama2_7b.yaml为LLAMA2_for_MS_code/configs/llama2/no_resume_pretrain_llama2_7b.yaml,修改该YAML中的以下字段。
    在使用编译缓存的情况下,不会生成strategy文件,运行前请删除编译缓存。
     src_strategy_path_or_dir: './output/strategy'  
     resume_training: False
     load_checkpoint: ''   
  7. 准备任务训练YAML,根据实际情况修改挂载卷的服务器IP地址等配置。YAML中各参数继承PyTorch场景。
    cd LLAMA2_for_MS_code/yamls
    wget https://gitee.com/ascend/mindxdl-deploy/raw/master/samples/train/resumable-training/fault-tolerance/ranktable/mindspore/llama2/yamls/ms_multinodes_acjob_910b.yaml
    修改启动命令为:
                  command:                           # training command, which can be modified
                    - /bin/bash
                    - -c
                    - |
                     cd /job/code/;bash /job/code/scripts/msrun_launcher.sh "run_mindformer.py --config configs/llama2/pretrain_llama2_7b.yaml --train_dataset_dir {数据集的实际路径}/wiki4096/wiki4096.mindrecord --use_parallel True --run_mode train"