Adapting to MindSpeed-LLM

Prerequisites

  • Learn about Constraints of MindIO TFT.
  • Prepare the MindSpeed-LLM framework by referring to MindSpeed-LLM. Note that the matching Megatron-LM version is core_v0.12.1.
  • The release package is intended for use with the MindSpeed-LLM 2.3.0 branch. For detailed instructions on preparing the environment, code, and dataset, refer to the guidelines in the MindSpeed-LLM repository to ensure their security.
  • MindIO TFT can adapt to the MindSpeed-LLM framework. Currently, MindIO TTP, MindIO UCE, and MindIO ARF are supported.
  • For the PyTorch framework, after MindCluster is installed or enabled, skip torchrun modification (1). Instead, MindCluster controls process termination.

Procedure

  1. (Optional) Edit the torchrun file.
    1. Search for the torchrun file in the environment.
      which torchrun
    2. Open the torchrun file in the path displayed after the preceding command is executed.
      vim {torchrun file path}/torchrun
    3. Press i to enter the insert mode and modify the file as follows:
      # Add the content in bold.
      import re
      import sys
      import mindio_ttp.framework_ttp
      from torch.distributed.run import main as torch_main
    4. Press Esc, type :wq!, and press Enter to save the changes and exit.
  2. Edit the pre-training script (for reference only).

    The following uses the examples/mcore/llama2/pretrain_llama2_7b_ptd.sh script as an example.

    1. Open the examples/mcore/llama2/pretrain_llama2_7b_ptd.sh script.
      vim examples/mcore/llama2/pretrain_llama2_7b_ptd.sh
    2. Press i to enter the insert mode. To high availability functions, add the information in bold to the script.
      #!/bin/bash
      
      export CUDA_DEVICE_MAX_CONNECTIONS=1
      export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
      
      export GLOO_SOCKET_IFNAME=enp189s0f0
      export TTP_ADDR="master node ip"
      source /usr/local/Ascend/cann/set_env.sh
      
      NPUS_PER_NODE=8
      MASTER_ADDR=localhost
      MASTER_PORT=6000
      NNODES=1
      NODE_RANK=0
      WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
      
      CKPT_SAVE_DIR="your model save ckpt path"
      DATA_PATH="your data path"
      TOKENIZER_MODEL="your tokenizer path"
      CKPT_LOAD_DIR="your model ckpt path"
      TP=1
      PP=2
      
      DISTRIBUTED_ARGS="
          --nproc_per_node $NPUS_PER_NODE \
          --nnodes $NNODES \
          --node_rank $NODE_RANK \
          --master_addr $MASTER_ADDR \
          --master_port $MASTER_PORT
      "
      
      GPT_ARGS="
          --use-mcore-models \
          --tensor-model-parallel-size ${TP} \
          --pipeline-model-parallel-size ${PP} \
          --sequence-parallel \
          --num-layers 32 \
          --hidden-size 4096 \
          --ffn-hidden-size 11008 \
          --num-attention-heads 32 \
          --tokenizer-type Llama2Tokenizer \
          --tokenizer-model ${TOKENIZER_MODEL} \
          --seq-length 4096 \
          --max-position-embeddings 4096 \
          --micro-batch-size 1 \
          --global-batch-size 256 \
          --make-vocab-size-divisible-by 1 \
          --lr 1.25e-6 \
          --train-iters 5000 \
          --lr-decay-style cosine \
          --untie-embeddings-and-output-weights \
          --disable-bias-linear \
          --attention-dropout 0.0 \
          --init-method-std 0.01 \
          --hidden-dropout 0.0 \
          --position-embedding-type rope \
          --normalization RMSNorm \
          --use-fused-rmsnorm \
          --swiglu \
          --use-flash-attn \
      
          --no-masked-softmax-fusion \
          --attention-softmax-in-fp32 \
          --min-lr 1.25e-7 \
          --weight-decay 1e-1 \
          --lr-warmup-fraction 0.01 \
          --clip-grad 1.0 \
          --adam-beta1 0.9 \
          --initial-loss-scale 65536 \
          --adam-beta2 0.95 \
          --no-gradient-accumulation-fusion \
          --no-load-optim \
          --no-load-rng \
          --use-distributed-optimizer \
          --use-fused-swiglu \
          --use-fused-rotary-pos-emb \
          --overlap-grad-reduce \
          --bf16 \
          --enable-high-availability \
          --enable-hbmfault-repair \
          --enable-worker-reboot \
          --distributed-optimizer-no-replica \
      "
      
      DATA_ARGS="
          --data-path $DATA_PATH \
          --split 949,50,1
      "
      
      OUTPUT_ARGS="
          --log-interval 1 \
          --save-interval 10000 \
          --eval-interval 1000 \
          --eval-iters 10 \
      "
      
      torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
          $GPT_ARGS \
          $DATA_ARGS \
          $OUTPUT_ARGS \
          --distributed-backend nccl \
          --load $CKPT_LOAD_DIR \
          --save $CKPT_SAVE_DIR \
          | tee logs/train_llama2_7b.log

      The parameters related to the high availability function are described as follows:

      • Set GLOO_SOCKET_IFNAME based on the high-speed NIC of the primary node.
      • TTP_ADDR indicates the IPv4 address of the primary node in the cluster. For details, see Environment Variables.
      • Change the path of the set_env.sh file to the actual CANN installation path.
      • enable-high-availability indicates whether to enable MindIO TFT. By default, this function is disabled. After this function is enabled, the dying gasp function is enabled by default.

        After MindIO TFT is enabled, the memory of each optimizer changes. For details, see Table 1.

        For a distributed optimizer, the addition of optimizer replicas increases static memory. However, the DP Size increases as the cluster scale becomes large, and the average increase in memory per NPU is minimal, helping to avoid OOM. Therefore, it is advised to enable this parameter in large clusters. Determine whether to enable this function based on the actual NPU memory.

      • enable-hbmfault-repair indicates whether to enable MindIO UCE, which is disabled by default. After this function is enabled, faults of on-chip memory are detected and rectified online, achieving step-level recomputation. This feature takes effect only when enable-high-availability is enabled. This feature depends on the memory management mechanism of PyTorch. This feature can be used only when the environment variable PYTORCH_NO_NPU_MEMORY_CACHING of PyTorch is not configured, that is, the memory reuse mechanism is enabled. If export PYTORCH_NO_NPU_MEMORY_CACHING=1 is configured, this feature cannot be used.
      • enable-worker-reboot indicates whether to enable MindIO ARF, which is disabled by default. After this function is enabled, the process-level restart is performed to continue training when a common fault occurs. This function takes effect only when enable-high-availability is enabled.
      • distributed-optimizer-no-replica: After high availability is enabled, optimizer replicas are added for the distributed optimizer by default, which increases the on-chip memory usage. If this function is enabled, no memory usage is increased for distributed optimizer replicas. In the MindIO UCE and MindIO ARF scenarios, periodic checkpoints are used for online recovery.
      Table 1 Theoretical value changes of optimizer parameters between the native optimizer and optimizer with MindIO TFT enabled

      Optimizer

      Native

      MindIO TFT Enabled

      Description

      fp16/bf16

      20

      20

      -

      fp32

      16

      16

      -

      fp16/bf16 Distributed

      4 + 16/d

      4 + 16 * N/d

      • d indicates the DP Group Size.
      • N indicates the number of replicas. The value of N is less than that of d.
    3. Press Esc, type :wq!, and press Enter to save the changes and exit.