1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | #!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 NPUS_PER_NODE=8 MASTER_ADDR=localhost MASTER_PORT=6001 NNODES=1 NODE_RANK=0 WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES)) CHECKPOINT_PATH=./ckpt_gpt VOCAB_FILE=/home/dataset/enwiki/gpt2-vocab.json MERGE_FILE=/home/dataset/enwiki/gpt2-merges.txt DATA_PATH=/home/dataset/enwiki/my-t5_text_sentence TP=2 PP=1 CP=1 EP=2 DISTRIBUTED_ARGS=" --nproc_per_node $NPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT " MOE_ARGS=" --expert-model-parallel-size ${EP} \ --moe-model-type megatron_moe \ --num-experts 4 \ --moe-permutation-async-comm \ --moe-grouped-gemm \ --moe-token-dispatcher-type alltoall \ --moe-router-topk 2 \ " RECOMPUTE_ARGS=" --recompute-activation-function \ --swap-attention \ --recompute-num-layers 1 \ " GPT_ARGS=" --tensor-model-parallel-size ${TP} \ --pipeline-model-parallel-size ${PP} \ --disable-bias-linear \ --reuse-fp32-param \ --use-mcore-models \ --use-distributed-optimizer \ --overlap-grad-reduce \ --use-fused-rotary-pos-emb \ --sequence-parallel \ --num-layers 2 \ --hidden-size 12288 \ --num-attention-heads 96 \ --seq-length 8192 \ --max-position-embeddings 8192 \ --micro-batch-size 1 \ --global-batch-size 4 \ --train-iters 1000 \ --lr 5.0e-7 \ --lr-decay-iters 320000 \ --lr-decay-style cosine \ --clip-grad 1.0 \ --weight-decay 0.1 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --init-method-std 0.006 \ --no-gradient-accumulation-fusion \ --use-flash-attn \ --position-embedding-type rope \ --no-bias-gelu-fusion \ --no-bias-dropout-fusion \ --attention-dropout 0.0 \ --hidden-dropout 0.0 \ --bf16 " DATA_ARGS=" --data-path $DATA_PATH \ --vocab-file $VOCAB_FILE \ --merge-file $MERGE_FILE \ --vocab-size 50257 \ --num-workers 4 \ --split 949,50,1 " OUTPUT_ARGS=" --log-interval 1 \ --save-interval 10000 \ --eval-interval 10000 \ --eval-iters 10 \ --log-throughput \ --timing-log-option max \ --no-barrier-with-level-1-timing \ --timing-log-level 0 \ " torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \ $GPT_ARGS \ $MOE_ARGS \ $RECOMPUTE_ARGS \ $DATA_ARGS \ $OUTPUT_ARGS \ --distributed-backend nccl \ --distributed-timeout-minutes 10 \ --seed 1234 set +x |
1 2 3 4 | CHECKPOINT_PATH=./ckpt_gpt VOCAB_FILE=/home/dataset/enwiki/gpt2-vocab.json MERGE_FILE=/home/dataset/enwiki/gpt2-merges.txt DATA_PATH=/home/dataset/enwiki/my-t5_text_sentence |
以上路径请根据实际情况进行替换。
bash pretrain_gpt_megatron_moe_8k.sh
--tensor-model-parallel-size # 张量并行 --pipeline-model-parallel-size # 流水线并行 --sequence-parallel # 序列并行
--use-flash-attn # Flash Attention融合算子 --position-embedding-type rope # RoPE位置嵌入 --use-fused-rotary-pos-emb # RoPE融合算子
--moe-model-type megatron_moe # 使用megatron moe模型 --num_experts 4 # 专家数量 --expert-model-parallel-size # 专家并行
--moe-grouped-gemm # gemm融合算子 --moe-permutation-async-comm # permute操作与通信并行