本章节将以LLaMA2模型适配示例(Pytorch场景)为例,指导用户step by step地完成断点续训的适配步骤。
步骤 |
示例 |
|
---|---|---|
准备模型代码及数据集 |
||
获取示例脚本 |
参考LLaMA2模型适配示例的步骤7。 |
|
为拉起训练,需要配置相关词表路径、数据集路径、通信网口等 |
参考LLaMA2模型适配示例的步骤8。 |
|
适配模型脚本重调度功能 |
进程级别重调度 |
参考步骤9-1配置进程级别重调度 |
进程级在线恢复 |
参考步骤9-2配置进程级在线恢复 |
|
(可选)配置临终CheckPoint恢复功能 |
参考步骤9-3配置临终CheckPoint恢复功能 |
下面将以单台Atlas 800T A2 训练服务器为例,介绍LLAMA2模型上如何适配进程级别重调度及进程级在线恢复功能。
示例为双机训练,如需修改训练节点数需配置spec.runPolicy.miAvailabel为Master和Worker的replicas数量之和,spec.runPolicy.minAvailabel值为训练总节点数。
git clone https://gitee.com/ascend/MindSpeed-LLM cd MindSpeed-LLM git checkout 1.0.0
git clone https://github.com/NVIDIA/Megatron-LM.git cd Megatron-LM git checkout core_r0.6.0
... class CustomFunction(torch.autograd.Function): @staticmethod def forward(ctx, input): torch.cuda.set_stream(torch.cuda.default_stream()) return input @staticmethod def backward(ctx, grad): torch.cuda.set_stream(torch.cuda.default_stream()) return grad def streamHandler(): input_tensor = torch.empty(1, dtype=torch.float32, device="npu", requires_grad=True) grad_tensor = torch.empty(1, dtype=torch.float32, device="npu", requires_grad=True) output_tensor = CustomFunction.apply(input_tensor) output_tensor.backward(grad_tensor) def pretrain(train_valid_test_dataset_provider, ... if args.do_train and args.train_iters > 0: if args.enable_high_availability: from mindio_ttp.adaptor import tft_register_processor, tft_train from mindio_ttp.framework_ttp import tft_register_set_stream_handler tft_register_set_stream_handler(streamHandler) tft_register_processor(train_valid_test_dataset_provider, model_provider, model_type) iteration, num_floating_point_operations_so_far = tft_train(train_args, test_data_iterator_list) else: iteration, num_floating_point_operations_so_far = train(*train_args) ...
cp -r megatron ../MindSpeed-LLM/
root@ubuntu:/data/atlas_dls/public/dataset/llama70B_data/# pwd
1
|
/data/atlas_dls/public/dataset/llama70B_data/ |
root@ubuntu:/data/atlas_dls/public/dataset/llama70B_data/# du -sh
1
|
892M |
root@ubuntu:/data/atlas_dls/public/code/LLAMA2_for_PyTorch_2.1_code/scripts# scripts/ └── train_start.sh
root@ubuntu:/data/atlas_dls/public/code/LLAMA2_for_PyTorch_2.1_code/ # vim scripts/train_start.sh
export GLOO_SOCKET_IFNAME=enp189s0f0 # 物理机上可以通信的网口,根据主节点高速网卡实际情况进行配置,如任务yaml中配置hostNetwork为false,则设置为eth0 export HCCL_SOCKET_IFNAME=enp189s0f0 # 如任务yaml中配置hostNetwork为false,则设置为eth0 export ELASTIC_PROCESS_RECOVER_ENABLE=1 # 开启Elastic Agent侧进程级别重调度、进程级在线恢复、临终CheckPoint恢复功能功能 LOAD_CHECKPOINT_PATH=/job/code/output/ckpt # 设置ckpt保存目录,注意ckpt、日志文件等应在yaml进行挂载到宿主机 SAVE_CHECKPOINT_PATH=/job/code/output/ckpt # 设置ckpt保存目录 # 数据集路径如:DATA_PATH="/job/data/testcode/dataset/llama_text_document" DATA_PATH="/job/data/testcode/dataset/llama_text_document" # 配置数据集路径 # 词表路径如: TOKENIZER_MODEL="/job/data/testcode/dataset/llama/tokenizer.model" TOKENIZER_MODEL="/job/data/testcode/dataset/llama/tokenizer.model" # 配置词表路径 #(可选)自定义配置Elastic Agent运行日志的落盘路径 mkdir -p /job/code/alllogs/$MINDX_TASK_ID/elasticlogs/elastic-log$XDL_IP-$RANK #MINDX_TASK_ID 为训练任务ID,使用XDL_IP、RANK区分不同节点Elastic Agent日志 export ELASTIC_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/elasticlogs/elastic-log$XDL_IP-$RANK #输入Elastic Agent组件运行日志的落盘路径 #(可选)配置组件间grpc通信使用安全链接 export ELASTIC_GRPC_SECURE_CONNECT=on #安全链接开关,取值为"on"时,表示打开配置 export ELASTIC_GRPC_SECURE_CERTIFICATES_PATH=/usr/security/cert #安全证书地址,请将/usr/security/cert替换为有效的安全证书地址
... export GLOO_SOCKET_IFNAME=eth0 #eth0是容器内可以通信的网口 export HCCL_SOCKET_IFNAME=eth0 ...
yaml中相关数据集、代码挂载如code、data等挂在卷路径请依据实际情况修改。
... apiVersion: mindxdl.gitee.com/v1 kind: AscendJob ... labels: framework: pytorch ring-controller.atlas: ascend-{xxx}b fault-scheduling: "force" fault-retry-times: "10" # 开启无条件重试 pod-rescheduling: "on" # 开启Pod级别重调度,进程级别重调度依赖此开关 process-recover-enable: "on" # 进程级恢复需开启该在线恢复开关 subHealthyStrategy: "ignore" # 亚健康处理策略 tor-affinity: "null" # 该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不适用。large-model-schema表示大模型任务,normal-schema 普通任务 annotations: recover-strategy: "recover" #任务可用恢复策略为进程级别重调度 ... replicaSpecs: Master: replicas: 1 restartPolicy: Never template: metadata: ... - name: TTP_PORT value: "8000" # 用于MindIO通信,请注意上下保持一致 - name: PROCESS_RECOVER value: "on" # 开启进程级别重调度需注入该环境变量 - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP command: # training command, which can be modified - /bin/bash - -c args: [ "cd /job/code;source /usr/local/Ascend/ascend-toolkit/set_env.sh;export LOGLEVEL=DEBUG;chmod +x scripts/train_start.sh; bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --sequence-parallel --num-layers 32 --hidden-size 4096 --ffn-hidden-size 11008 --num-attention-heads 32 --tokenizer-type Llama2Tokenizer --seq-length 4096 --max-position-embeddings 4096 --micro-batch-size 1 --global-batch-size 256 --make-vocab-size-divisible-by 1 --lr 1.25e-6 --train-iters 5000 --lr-decay-style cosine --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --init-method-std 0.01 --hidden-dropout 0.0 --position-embedding-type rope --normalization RMSNorm --use-fused-rmsnorm --swiglu --use-flash-attn --no-masked-softmax-fusion --attention-softmax-in-fp32 --min-lr 1.25e-7 --weight-decay 1e-1 --lr-warmup-fraction 0.01 --clip-grad 1.0 --adam-beta1 0.9 --initial-loss-scale 65536 --adam-beta2 0.95 --no-gradient-accumulation-fusion --no-load-optim --no-load-rng --use-distributed-optimizer --use-fused-swiglu --use-fused-rotary-pos-emb --overlap-grad-reduce --bf16 --enable-high-availability --enable-worker-reboot --data-path $DATA_PATH --split 949,50,1 --log-interval 1 --save-interval 20 --eval-interval 1000 --eval-iters 10 --distributed-backend nccl"] # enable-high-availability:故障快速恢复特性开关,默认关闭,配置后即开启临终遗言功能。 # enable-hbmfault-repair: 进程级在线恢复功能开关,默认关闭,配置后对片上内存进行故障检测,并完成在线修复。需同时开启enable-high-availability。 # enable-worker-reboot:进程级别重调度功能开关,默认关闭,配置后在发生一般性故障时,进行进程级别调度,继续训练。需同时开启enable-high-availability。 ports: # default value - containerPort: 2222 name: ascendjob-port if not set - containerPort: 8000 # 用于MindIO通信,请注意上下保持一致 name: ttp-port resources: limits: huawei.com/Ascend910: 8 requests: huawei.com/Ascend910: 8 ... ... replicaSpecs: Worker: replicas: 1 restartPolicy: Never template: metadata: ... env: - name: TTP_PORT value: "8000" # 用于MindIO通信,请注意上下保持一致 command: # training command, which can be modified - /bin/bash - -c args: [ "cd /job/code;source /usr/local/Ascend/ascend-toolkit/set_env.sh;export PROCESS_RECOVER=on; export LOGLEVEL=DEBUG;chmod +x scripts/train_start.sh; bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --sequence-parallel --num-layers 32 --hidden-size 4096 --ffn-hidden-size 11008 --num-attention-heads 32 --tokenizer-type Llama2Tokenizer --seq-length 4096 --max-position-embeddings 4096 --micro-batch-size 1 --global-batch-size 256 --make-vocab-size-divisible-by 1 --lr 1.25e-6 --train-iters 5000 --lr-decay-style cosine --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --init-method-std 0.01 --hidden-dropout 0.0 --position-embedding-type rope --normalization RMSNorm --use-fused-rmsnorm --swiglu --use-flash-attn --no-masked-softmax-fusion --attention-softmax-in-fp32 --min-lr 1.25e-7 --weight-decay 1e-1 --lr-warmup-fraction 0.01 --clip-grad 1.0 --adam-beta1 0.9 --initial-loss-scale 65536 --adam-beta2 0.95 --no-gradient-accumulation-fusion --no-load-optim --no-load-rng --use-distributed-optimizer --use-fused-swiglu --use-fused-rotary-pos-emb --overlap-grad-reduce --bf16 --enable-high-availability --enable-worker-reboot --data-path $DATA_PATH --split 949,50,1 --log-interval 1 --save-interval 20 --eval-interval 1000 --eval-iters 10 --distributed-backend nccl"] ports: # default value - containerPort: 2222 name: ascendjob-port if not set - containerPort: 8000 # 用于MindIO通信,请注意上下保持一致 name: ttp-port ... resources: limits: huawei.com/Ascend910: 8 requests: huawei.com/Ascend910: 8
... apiVersion: mindxdl.gitee.com/v1 kind: AscendJob ... labels: framework: pytorch ring-controller.atlas: ascend-{xxx}b fault-scheduling: "force" process-recover-enable: "on" # 进程级在线恢复需开启该开关 tor-affinity: "null" #该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不适用。large-model-schema表示大模型任务,normal-schema 普通任务 annotations: recover-strategy: "retry" ... replicaSpecs: Master: replicas: 1 restartPolicy: Never template: metadata: ... - name: TTP_PORT value: "8000" # 用于MindIO通信,请注意上下保持一致 - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP command: # training command, which can be modified - /bin/bash - -c args: [ "cd /job/code;source /usr/local/Ascend/ascend-toolkit/set_env.sh; export LOGLEVEL=DEBUG;chmod +x scripts/train_start.sh; bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --sequence-parallel --num-layers 32 --hidden-size 4096 --ffn-hidden-size 11008 --num-attention-heads 32 --tokenizer-type Llama2Tokenizer --seq-length 4096 --max-position-embeddings 4096 --micro-batch-size 1 --global-batch-size 256 --make-vocab-size-divisible-by 1 --lr 1.25e-6 --train-iters 5000 --lr-decay-style cosine --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --init-method-std 0.01 --hidden-dropout 0.0 --position-embedding-type rope --normalization RMSNorm --use-fused-rmsnorm --swiglu --use-flash-attn --no-masked-softmax-fusion --attention-softmax-in-fp32 --min-lr 1.25e-7 --weight-decay 1e-1 --lr-warmup-fraction 0.01 --clip-grad 1.0 --adam-beta1 0.9 --initial-loss-scale 65536 --adam-beta2 0.95 --no-gradient-accumulation-fusion --no-load-optim --no-load-rng --use-distributed-optimizer --use-fused-swiglu --use-fused-rotary-pos-emb --overlap-grad-reduce --bf16 --enable-high-availability --enable-hbmfault-repair --data-path $DATA_PATH --split 949,50,1 --log-interval 1 --save-interval 20 --eval-interval 1000 --eval-iters 10 --distributed-backend nccl"] # enable-high-availability:故障快速恢复特性开关,默认关闭,配置后即开启临终遗言功能。 # enable-hbmfault-repair: 进程级在线恢复功能开关,默认关闭,配置后对片上内存进行故障检测,并完成在线修复。需同时开启enable-high-availability。 # enable-worker-reboot:进程级别重调度功能开关,默认关闭,配置后在发生一般性故障时,进行进程级重启修复,继续训练。需同时开启enable-high-availability。 ports: # default value - containerPort: 2222 name: ascendjob-port if not set - containerPort: 8000 # 用于MindIO通信,请注意上下保持一致 name: ttp-port resources: limits: huawei.com/Ascend910: 8 requests: huawei.com/Ascend910: 8 ... ... replicaSpecs: Worker: replicas: 1 restartPolicy: Never template: metadata: ... env: - name: TTP_PORT value: "8000" # 用于MindIO通信,请注意上下保持一致 command: # training command, which can be modified - /bin/bash - -c args: [ "cd /job/code;source /usr/local/Ascend/ascend-toolkit/set_env.sh; export LOGLEVEL=DEBUG;chmod +x scripts/train_start.sh; bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --sequence-parallel --num-layers 32 --hidden-size 4096 --ffn-hidden-size 11008 --num-attention-heads 32 --tokenizer-type Llama2Tokenizer --seq-length 4096 --max-position-embeddings 4096 --micro-batch-size 1 --global-batch-size 256 --make-vocab-size-divisible-by 1 --lr 1.25e-6 --train-iters 5000 --lr-decay-style cosine --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --init-method-std 0.01 --hidden-dropout 0.0 --position-embedding-type rope --normalization RMSNorm --use-fused-rmsnorm --swiglu --use-flash-attn --no-masked-softmax-fusion --attention-softmax-in-fp32 --min-lr 1.25e-7 --weight-decay 1e-1 --lr-warmup-fraction 0.01 --clip-grad 1.0 --adam-beta1 0.9 --initial-loss-scale 65536 --adam-beta2 0.95 --no-gradient-accumulation-fusion --no-load-optim --no-load-rng --use-distributed-optimizer --use-fused-swiglu --use-fused-rotary-pos-emb --overlap-grad-reduce --bf16 --enable-high-availability --enable-hbmfault-repair --data-path $DATA_PATH --split 949,50,1 --log-interval 1 --save-interval 20 --eval-interval 1000 --eval-iters 10 --distributed-backend nccl"] ports: # default value - containerPort: 2222 name: ascendjob-port if not set - containerPort: 8000 # 用于MindIO通信,请注意上下保持一致 name: ttp-port ... resources: limits: huawei.com/Ascend910: 8 requests: huawei.com/Ascend910: 8
在任务annotations中配置的任务可用恢复策略中需取值为dump;在训练参数中需要启用--enable-high-availability参数。并按实际情况修改任务yaml的如下加粗字段。
... apiVersion: mindxdl.gitee.com/v1 kind: AscendJob ... labels: framework: pytorch ring-controller.atlas: ascend-{xxx}b fault-scheduling: "force" tor-affinity: "null" #该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不适用。large-model-schema表示大模型任务,normal-schema 普通任务 annotations: recover-strategy: "dump" ... replicaSpecs: Master: replicas: 1 restartPolicy: Never template: metadata: ... - name: TTP_PORT value: "8000" # 用于MindIO通信,请注意上下保持一致 - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP command: # training command, which can be modified - /bin/bash - -c args: [ "cd /job/code;source /usr/local/Ascend/ascend-toolkit/set_env.sh; export LOGLEVEL=DEBUG;chmod +x scripts/train_start.sh; bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --sequence-parallel --num-layers 32 --hidden-size 4096 --ffn-hidden-size 11008 --num-attention-heads 32 --tokenizer-type Llama2Tokenizer --seq-length 4096 --max-position-embeddings 4096 --micro-batch-size 1 --global-batch-size 256 --make-vocab-size-divisible-by 1 --lr 1.25e-6 --train-iters 5000 --lr-decay-style cosine --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --init-method-std 0.01 --hidden-dropout 0.0 --position-embedding-type rope --normalization RMSNorm --use-fused-rmsnorm --swiglu --use-flash-attn --no-masked-softmax-fusion --attention-softmax-in-fp32 --min-lr 1.25e-7 --weight-decay 1e-1 --lr-warmup-fraction 0.01 --clip-grad 1.0 --adam-beta1 0.9 --initial-loss-scale 65536 --adam-beta2 0.95 --no-gradient-accumulation-fusion --no-load-optim --no-load-rng --use-distributed-optimizer --use-fused-swiglu --use-fused-rotary-pos-emb --overlap-grad-reduce --bf16 --enable-high-availability --data-path $DATA_PATH --split 949,50,1 --log-interval 1 --save-interval 20 --eval-interval 1000 --eval-iters 10 --distributed-backend nccl"] # enable-high-availability:故障快速恢复特性开关,默认关闭,配置后即开启临终遗言功能。 ports: # default value - containerPort: 2222 name: ascendjob-port if not set - containerPort: 8000 # 用于MindIO通信,请注意上下保持一致 name: ttp-port resources: limits: huawei.com/Ascend910: 8 requests: huawei.com/Ascend910: 8 ... ... replicaSpecs: Worker: replicas: 1 restartPolicy: Never template: metadata: ... env: - name: TTP_PORT value: "8000" # 用于MindIO通信,请注意上下保持一致 command: # training command, which can be modified - /bin/bash - -c args: [ "cd /job/code;source /usr/local/Ascend/ascend-toolkit/set_env.sh; export LOGLEVEL=DEBUG;chmod +x scripts/train_start.sh; bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py --tensor-model-parallel-size 4 --pipeline-model-parallel-size 1 --sequence-parallel --num-layers 32 --hidden-size 4096 --ffn-hidden-size 11008 --num-attention-heads 32 --tokenizer-type Llama2Tokenizer --seq-length 4096 --max-position-embeddings 4096 --micro-batch-size 1 --global-batch-size 256 --make-vocab-size-divisible-by 1 --lr 1.25e-6 --train-iters 5000 --lr-decay-style cosine --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --init-method-std 0.01 --hidden-dropout 0.0 --position-embedding-type rope --normalization RMSNorm --use-fused-rmsnorm --swiglu --use-flash-attn --no-masked-softmax-fusion --attention-softmax-in-fp32 --min-lr 1.25e-7 --weight-decay 1e-1 --lr-warmup-fraction 0.01 --clip-grad 1.0 --adam-beta1 0.9 --initial-loss-scale 65536 --adam-beta2 0.95 --no-gradient-accumulation-fusion --no-load-optim --no-load-rng --use-distributed-optimizer --use-fused-swiglu --use-fused-rotary-pos-emb --overlap-grad-reduce --bf16 --enable-high-availability --data-path $DATA_PATH --split 949,50,1 --log-interval 1 --save-interval 20 --eval-interval 1000 --eval-iters 10 --distributed-backend nccl"] ports: # default value - containerPort: 2222 name: ascendjob-port if not set - containerPort: 8000 # 用于MindIO通信,请注意上下保持一致 name: ttp-port ... resources: limits: huawei.com/Ascend910: 8 requests: huawei.com/Ascend910: 8