1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 | #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 NPUS_PER_NODE=8 MASTER_ADDR=localhost MASTER_PORT=6001 NNODES=1 NODE_RANK=0 WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES)) CKPT_DIR=./ckpt_llama DATA_PATH="/home/dataset/llama2/alpaca_text_document" TOKENIZER_MODEL="/home/dataset/model/llama-2-7b-hf/tokenizer.model" TP=2 PP=2 CP=1 EP=1 DISTRIBUTED_ARGS=" --nproc_per_node $NPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " GPT_ARGS=" --tensor-model-parallel-size ${TP} \ --pipeline-model-parallel-size ${PP} \ --num-layers-per-virtual-pipeline-stage 1 \ --reuse-fp32-param \ --sequence-parallel \ --use-fused-rotary-pos-emb \ --use-fused-swiglu \ --use-fused-rmsnorm \ --use-distributed-optimizer \ --overlap-grad-reduce \ --num-layers 10 \ --hidden-size 8192 \ --ffn-hidden-size 28672 \ --num-attention-heads 64 \ --tokenizer-type Llama2Tokenizer \ --tokenizer-model ${TOKENIZER_MODEL} \ --seq-length 4096 \ --max-position-embeddings 4096 \ --micro-batch-size 1 \ --global-batch-size 16 \ --make-vocab-size-divisible-by 1 \ --lr 1.0e-6 \ --train-iters 1000 \ --lr-decay-style cosine \ --untie-embeddings-and-output-weights \ --attention-dropout 0.0 \ --init-method-std 0.01 \ --hidden-dropout 0.0 \ --position-embedding-type rope \ --normalization RMSNorm \ --swiglu \ --use-flash-attn \ --no-masked-softmax-fusion \ --attention-softmax-in-fp32 \ --min-lr 1.0e-7 \ --weight-decay 0.1 \ --clip-grad 1.0 \ --adam-beta1 0.9 \ --initial-loss-scale 4096.0 \ --adam-beta2 0.95 \ --adam-eps 1e-5 \ --disable-bias-linear \ --group-query-attention \ --num-query-groups 8 \ --lr-warmup-fraction 0.01 \ --bf16 " DATA_ARGS=" --data-path $DATA_PATH \ --split 100,0,0 " OUTPUT_ARGS=" --log-throughput \ --log-interval 1 \ --save-interval 10000 \ --eval-interval 10000 \ --eval-iters 10 \ " torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \ $GPT_ARGS \ $DATA_ARGS \ $OUTPUT_ARGS \ --distributed-backend nccl \ set +x |
1 2 3 | CKPT_DIR=./ckpt_llama DATA_PATH="/home/dataset/llama2/alpaca_text_document" TOKENIZER_MODEL="/home/dataset/model/llama-2-7b-hf/tokenizer.model" |
以上路径请根据实际情况进行替换。
bash pretrain_llama2_70B_4k_tp2_pp2_vpp1_dp2.sh
--tensor-model-parallel-size # 张量并行 --pipeline-model-parallel-size # 流水线并行 --num-layers-per-virtual-pipeline-stage 1 # 每个虚拟流水线阶段的层数为1 --context-parallel-algo <algo> # algo可选ulysses_cp_algo或megatron_cp_algo # 如果开启GQA,推荐megatron_cp_algo,非GQA场景,序列32k下推荐ulysses_cp_algo --sequence-parallel # 序列并行
--use-flash-attn # Flash Attention融合算子 --normalization RMSNorm # RMSNorm归一化 --use-fused-rmsnorm # RMSNorm融合算子 --swiglu # SwiGLU激活函数 --use-fused-swiglu # SwiGLU融合算子