Adaptation Example

This section describes how to adapt resumable training step by step.

  • To ensure the normal use of graceful fault tolerance and process-level online recovery, ensure that the clocks of the master and worker nodes in the Kubernetes cluster are the same.
  • The displayed component code for resumable training is open source code. See Security Description to learn its security requirements.
  • The sample code below may differ from the actual implementation. Please use the actual code.
  • Configure the model parameters according to the settings defined in the model repository. Improper modifications may lead to unexpected issues.
  • If " Failed to bind the IP port. Reason: The IP address and port have been bound already" is displayed during training, rectify the fault as follows. For details, see "HCCL_HOST_SOCKET_PORT_RANGE" in CANN Environment Variable Reference.
    export HCCL_HOST_SOCKET_PORT_RANGE="60000-60050"
    export HCCL_NPU_SOCKET_PORT_RANGE="61000-61050"

Adaptation Example for PyTorch (MindSpeed-LLM)

For details about how to prepare the training code and dataset, see MindSpeed-LLM User Guide. The following uses two Atlas 800T A2 training servers as an example to describe the operations.

  1. Pull the training code.
    mkdir -p /data/atlas_dls/public/code
    cd /data/atlas_dls/public/code
    git clone https://gitcode.com/Ascend/MindSpeed-LLM.git
    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd MindSpeed-LLM
    git checkout 2.3.0
    cd ..
    cd Megatron-LM 
    git checkout core_v0.12.1
    cp -r megatron../MindSpeed-LLM # Copy the Megatron directory in the Megatron-LM project to the MindSpeed-LLM project.
    ## Rename MindSpeed-LLM to QWEN3_for_PyTorch_2.7_code.
    cd ..
    mv MindSpeed-LLM QWEN3_for_PyTorch_2.7_code
  2. Obtain model weights.

    Download the desired model weight file from the Qwen3 repository to a directory on the server, for example, /data/atlas_dls/public/dataset/qwen3-8b-hf.

  3. Obtain the dataset.

    Download the desired dataset (for example, Alpaca) to a directory on the server, for example, /data/atlas_dls/public/dataset/qwen3-alpaca.

  4. Process the dataset.
    1. Start the container.
      docker run -it -v /data/atlas_dls/public/:/data/atlas_dls/public/ -e ASCEND_VISIBLE_DEVICES=0-7 mindspeed-dl:v1 bash
    2. Perform the following operations in the container.
      export TORCH_DEVICE_BACKEND_AUTOLOAD=0
      source /usr/local/Ascend/cann/set_env.sh
      cd /data/atlas_dls/public/code/QWEN3_for_PyTorch_2.7_code
      # (Optional) Install MindSpeed in any directory. If it has been installed during image creation, skip this step.
      git clone https://gitcode.com/ascend/MindSpeed.git 
      cd MindSpeed 
      git checkout 2.3.0_core_r0.12.1
      pip install -r requirements.txt 
      pip install -e . 
      export PYTHONPATH=/data/atlas_dls/public/code/QWEN3_for_PyTorch_2.7_code/MindSpeed:$PYTHONPATH
      cd ..
    3. Process the dataset.

      Qwen3 requires that Transformers version be 4.51.0 or later. Therefore, Python 3.9 or later and Transformers 4.51.0 or later must be installed.

      python preprocess_data.py \
          --input /data/atlas_dls/public/dataset/qwen3-alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet \ # Dataset file path
          --tokenizer-name-or-path /data/atlas_dls/public/dataset/qwen3-8b-hf \ # Open source model weight file path
          --tokenizer-type PretrainedFromHF \
          --handler-name GeneralPretrainHandler \
          --output-prefix /data/atlas_dls/public/dataset/qwen3-alpaca/alpaca \ # Generated the alpaca_text_document.bin and .idx files.
          --json-keys text \
          --workers 4 \
          --log-interval 1000
      If the error message "/usr/local/lib/python3.10/dist-packages/sklearn/utils/../../scikit_learn.libs/libgomp-947d5fa1.so.1.0.0: cannot allocate memory in static TLS block" is displayed, run the following command to preload the libgomp library.
      export LD_PRELOAD="/usr/local/lib/python3.10/dist-packages/scikit_learn.libs/libgomp-947d5fa1.so.1.0.0"
  5. Go to the mindcluster-deploy repository, access a branch based on mindcluster-deploy Version Description, obtain the train_start.sh file in the samples/train/resumable-training/fault-tolerance/without-ranktable/pytorch/Qwen3 directory, and create the following directory structure on the management node.
    root@ubuntu:/data/atlas_dls/public/code/QWEN3_for_PyTorch_2.7_code/scripts#
    scripts/
    └── train_start.sh
  6. Obtain the training job YAML. Functions such as pod-level rescheduling, process-level rescheduling, process-level online recovery, and elastic training have been preconfigured in the YAML. Set the IP address of the server to which the volume is mounted and rescheduling levels as required.

    Training process recovery functions such as process-level rescheduling, process-level online recovery, and elastic training cannot coexist with graceful fault tolerance. For details about how to configure graceful fault tolerance, see Graceful Fault Tolerance Mode.

  7. Configure the train_start.sh script and training job YAML file as required.
    1. Modify the basic parameters in the startup script.
      mkdir -p /job/code/alllogs/$MINDX_TASK_ID/ttplogs
      mkdir -p /job/code/alllogs/$MINDX_TASK_ID/trainlogs
      mkdir -p /job/code/alllogs/$MINDX_TASK_ID/demo/
      # Log storage path. Change it as required.
      export ASCEND_PROCESS_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/plogs/$XDL_IP # Plog storage path. $MINDX_TASK_ID is the UID environment variable injected by Ascend Operator, and $XDL_IP is the environment variable status.hostIP written in the job YAML file.
      export TTP_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/ttplogs/ttplog$XDL_IP-$RANK #    TTP log storage path. $RANK is the environment variable injected by Ascend Operator for the PyTorch framework.
      export TRAIN_LOG_PATH=/job/code/alllogs/$MINDX_TASK_ID/trainlogs/$XDL_IP-$RANK #    Training log storage path.
      export GLOO_SOCKET_IFNAME=enp189s0f0             # Network port on the physical machine that can be used for communication. Set it based on the actual high-speed NIC of the master node. If hostNetwork is set to false in the job YAML file, set this parameter to eth0.
      export HCCL_SOCKET_IFNAME=enp189s0f0                # If hostNetwork is set to false in the job YAML file, set this parameter to eth0.
       
      CKPT_SAVE_DIR="/job/code/output/ckpt" # Path for saving weights after training is complete.
      DATA_PATH="/job/data/alpaca_text_document" # Dataset path. Input the path of the data saved during data preprocessing.
      TOKENIZER_PATH="/job/data/qwen3-8b-hf" # Tokenizer path. Input the path of the downloaded open source weight tokenizer path.
      CKPT_LOAD_DIR="/job/code/output/ckpt" # Path for loading weights.
    2. To use TaskD to implement process-level rescheduling, process-level online recovery, process-level in-place recovery, or elastic training, you need to start TaskD Manager.
      1. Create a manager.py file and save it to the directory where the training script is called. The content of the manager.py file is as follows:
        from taskd.api import init_taskd_manager, start_taskd_manager
        import os
        
        job_id=os.getenv("MINDX_TASK_ID")
        node_nums=XX         # Total number of nodes (set by yourself)
        proc_per_node=XX     # Number of training processes on each node (set by yourself)
        
        init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node})
        start_taskd_manager()

        For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.

      2. Add the following code to the training script to start TaskD Manager.

        In the code, TASKD_SO_PATH and export LD_PRELOAD statements are used to configure the path of libtaskd.so to the environment variable LD_PRELOAD after TaskD is installed. If the two statements fail to be configured, run the pip show taskd command to obtain the value of Location, combine the value with /taskd/python/cython_api/libs/libtaskd.so, and run the export command.

        sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py
        TASKD_SO_PATH="$(pip show taskd | awk '/^Location: / {print $2"/taskd/python/cython_api/libs/libtaskd.so"}')"
        export LD_PRELOAD=$TASKD_SO_PATH:$LD_PRELOAD
        export TASKD_PROCESS_ENABLE="on"
        if [[ "${RANK}" == 0 ]]; then
            export MASTER_ADDR=${POD_IP}
            python manager.py &            # Determined by the current path.
        fi
        torchrun $DISTRIBUTED_ARGS ...
      3. Modify the job YAML file by adding port 9601 for TaskD communication to all pods. (If the port exists, skip this step.)
        ...
                spec: 
        ...
                   containers: 
        ... 
                     ports:                           
                        - containerPort: 9601          
                          name: taskd-port
        ...

Adaptation Example of MindSpore (MindFormers)

For details about how to prepare the training code and dataset, see MindFormers documentation. The following uses two Atlas 900 A3 SuperPoDs as an example to describe the operations.

  1. Prepare code.
    mkdir -p /data/atlas_dls/public/code
    cd /data/atlas_dls/public/code
    git clone https://gitee.com/mindspore/mindformers.git
    cd mindformers
    git checkout f06a946af29c8c7e002a6c49458f513d47b642e5
    # Rename mindformers to QWEN3_for_MS_code.
    cd ..
    mv mindformers QWEN3_for_MS_code
  2. Prepare a dataset.

    Download a dataset from DagsHub and save it to a directory on the server, for example, /data/atlas_dls/public/code/QWEN3_for_MS_code/dataset.

  3. Convert the dataset.
    1. Download the dataset conversion script.

      Download the dataset conversion script by referring to Dataset Conversion and save it to a directory on the server, for example, /data/atlas_dls/public/code/QWEN3_for_MS_code/dataset/gen_wiki_json.py.

    2. Download the tokenizer file.

      Download the tokenizer file from Qwen3-32B and save it to a directory on the server, for example, /data/atlas_dls/public/code/QWEN3_for_MS_code/dataset/Qwen3-32B-tokenizer.

    3. Convert the dataset.
      1. Start the container and mount the required files.
        docker run -it -v /data/atlas_dls/public/code/:/data/atlas_dls/public/code/ mindformers-dl:v1 bash
      2. Run the conversion script to convert wiki.train.tokens to the JSONL format.
        # Prepare the Python environment before executing the script.
        cd /data/atlas_dls/public/code/QWEN3_for_MS_code/dataset
        python gen_wiki_json.py --input wiki.train.tokens  --output wiki.jsonl 
      3. Convert the data from the JSONL format to the bin format.
        # If the error message "ModuleNotFoundError: No module named 'xxx'" is displayed during the execution, install the required dependency.
        cd /data/atlas_dls/public/code/QWEN3_for_MS_code
        python toolkit/data_preprocess/megatron/preprocess_indexed_dataset.py \
          --input /data/atlas_dls/public/code/QWEN3_for_MS_code/dataset /wiki.jsonl \
          --output-prefix /data/atlas_dls/public/code/QWEN3_for_MS_code/dataset /wiki103-megatron \
          --tokenizer-type HuggingFaceTokenizer \
          --tokenizer-dir /data/atlas_dls/public/code/QWEN3_for_MS_code/dataset/Qwen3-32B-tokenizer # Use the corresponding tokenizer path of models with other specifications.

        After the execution is complete, the wiki103-megatron_text_document.bin and wiki103-megatron_text_document.idx files are generated in the /data/atlas_dls/public/code/QWEN3_for_MS_code/dataset directory. When entering the dataset path, use /data/atlas_dls/public/code/QWEN3_for_MS_code/dataset/wiki103-megatron_text_document without the file name extension.

  4. Obtain the training job YAML and training startup script and modify them as required.
    1. If the value of hostNetwork in the training job YAML file is false, set GLOO_SOCKET_IFNAME in the startup script to eth0. Example:
      export GLOO_SOCKET_IFNAME=eth0  # eth0 is the communication network port in the container.
      export HCCL_SOCKET_IFNAME=eth0

      Then, modify other parameters in the startup script as required.

    2. Modify the IP address of the server to which the volume is mounted and other configurations in the job YAML file as required.
    3. To use TaskD to implement process-level rescheduling, process-level online recovery, process-level in-place recovery, link failover communication suspension, and switchback, or online stress testing, you need to start TaskD Manager.
      1. Create a manager.py file and save it to the directory where the training script is called. The content of the manager.py file is as follows:
        from taskd.api import init_taskd_manager, start_taskd_manager
        import os
        
        job_id=os.getenv("MINDX_TASK_ID")
        node_nums=XX         # Total number of nodes (set by yourself)
        proc_per_node=XX     # Number of training processes on each node (set by yourself)
        
        init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node})
        start_taskd_manager()

        For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.

      2. Add the following code to the training script to start TaskD Manager. In the code, the first two statements are used to configure the path of libtaskd.so to the environment variable LD_PRELOAD after TaskD is installed. If the two statements fail to be configured, run the pip show taskd command to obtain the value of Location, combine the value with /taskd/python/cython_api/libs/libtaskd.so, and run the export command.
        TASKD_SO_PATH="$(pip show taskd | awk '/^Location: / {print $2"/taskd/python/cython_api/libs/libtaskd.so"}')"
        export LD_PRELOAD=$TASKD_SO_PATH:$LD_PRELOAD
        export TASKD_PROCESS_ENABLE="on"
        if [[ "${MS_SCHED_HOST}" == "${POD_IP}" ]]; then
            python manager.py &   # Determined by the current path.
        fi
        msrun ...
      3. Modify the job YAML file by adding port 9601 for TaskD communication to all pods. (If the port exists, skip this step.)
        ...
                spec: 
        ...
                   containers: 
        ... 
                     ports:                           
                        - containerPort: 9601          
                          name: taskd-port
        ...
  5. Modify the parameter configuration file.
    1. Open the configs/qwen3/pretrain_qwen3_32b_4k.yaml file.
      vi configs/qwen3/pretrain_qwen3_32b_4k.yaml
    2. Press i to enter the insert mode and modify the parameter configuration file.
      1. Modify the following configurations in bold, including the dataset path, distributed parallel parameters, and model parameters. The following model parameters are for reference only. Modify them as required.
        train_dataset: &train_dataset
          data_loader:
            type: BlendedMegatronDatasetDataLoader
            datasets_type: "GPTDataset"
            sizes:
              - 8000  # Number of samples in the training set
              - 0     # Number of samples in the test set (currently unsupported)
              - 0     # Number of samples in the evaluation set (currently unsupported)
            config:
              seed: 1234  # Random seed for data sampling
              split: "1, 0, 0"  # Proportions for training, test, and evaluation sets (test/eval currently unsupported)
              seq_length: 4096  # Sequence length of the dataset
              eod_mask_loss: False  # Whether to calculate loss at the end-of-document (EOD)
              reset_position_ids: False  # Whether to reset position_ids at EOD
              create_attention_mask: True  # Whether to include attention_mask in the dataset
              reset_attention_mask: False  # Whether to reset attention_mask at EOD, creating a stepped attention_mask
              create_compressed_eod_mask: False  # Whether to include a compressed attention_mask
              eod_pad_length: 128  # Length of the compressed attention_mask
              eod: 1  # Token ID for EOD in the dataset
              pad: -1  # Token ID for padding in the dataset
              data_path:  # Sampling proportion and path for the Megatron dataset
                - '1'
                - "/job/data/wiki103-megatron_text_document" # Dataset path
        ...
        # Parallel configuration
        parallel_config:
          data_parallel: &dp 4  # Number of data parallel. If using the high availability feature, it must be an even number.
          model_parallel: 8  # Number of model parallel
          pipeline_stage: 1  # Number of pipeline parallel
          micro_batch_num: 1  # Pipeline parallel microbatch size
          use_seq_parallel: False  # Whether to enable sequence parallelism
          gradient_aggregation_group: 1  # Size of the gradient communication operator fusion group
        # When model_parallel > 1, setting micro_batch_interleave_num to 2 may accelerate the training process.
        micro_batch_interleave_num: 1
        ...
        model:
          model_config:
            # Configurations from Hugging Face
            vocab_size: 75968            # A smaller value is used here for testing only. Adjust the value as required.
            hidden_size: 2560            # A smaller value is used here for testing only. Adjust the value as required.
            intermediate_size: 12800   # A smaller value is used here for testing only. Adjust the value as required.
            num_hidden_layers: 32      # A smaller value is used here for testing only. Adjust the value as required.
            num_attention_heads: 32    # A smaller value is used here for testing only. Adjust the value as required.
            num_key_value_heads: 8
            head_dim: 128
            hidden_act: 'swiglu'
            max_position_embeddings: 4096
            seq_length: 4096
            initializer_range: 0.02
            rms_norm_eps: 1.e-6
            use_cache: True
            tie_word_embeddings: False
            rope_theta: 1000000.
            attention_bias: False
            use_flash_attention: True
            add_bias_linear: False
            eos_token_id: 151645
            pad_token_id: 151643
            bos_token_id: 151643
            attention_dropout: 0.0
            # Configurations from MindFormers
            hidden_dropout: 0.0
            input_sliced_sig: True
            untie_embeddings_and_output_weights: True
            position_embedding_type: "rope"
            qk_layernorm: True
            use_contiguous_weight_layout_attention: False
            qkv_concat: True
            offset: [0]
            params_dtype: "float32"
            compute_dtype: "bfloat16"
            layernorm_compute_dtype: "float32"
            softmax_compute_dtype: "float32"
            rotary_dtype: "float32"
            residual_dtype: "float32"
            model_type: "qwen3"
            architectures: ["Qwen3ForCausalLM"]
      2. (Optional) If the dying gasp checkpoint is used, you need to modify the following configuration fields after the checkpoint is saved and the checkpoint is loaded through pod-level rescheduling.

        Ensure that the directory under load_checkpoint contains normal checkpoints or is empty when training is started for the first time; otherwise training may fail to be started.

        resume_training: True 
        src_strategy_path_or_dir: './output/strategy'
        load_checkpoint: './output/checkpoint'
    3. Press Esc, type :wq!, and press Enter to save the changes and exit.

Adaptation Example of RL Post-Training (verl)

MindCluster supports only job-level rescheduling. The verl training jobs are managed by the Ray cluster. To adapt to the Ascend job deployment of MindCluster, a pod is deployed on each worker node, and all processes on the Ray cluster are carried in the pod. The head node of the Ray cluster is determined by the node on which Ascend Operator injects the environment variable RANK=0. The pod of the node with RANK=0 starts the Ray cluster and submits the verl post-training job. The pods of other worker nodes join the Ray cluster. Finally, all nodes check whether the submitted training job is abnormal.

  • If an exception occurs, a non-0 exit code is returned. Then, Volcano detects the service exception and triggers job-level rescheduling.
  • If no exception occurs and the job is complete, the exit code 0 is returned.
  • Ensure that all the following steps are performed on each worker node.
  • In the following example, the Qwen3 30B MoE model and the DAPO-Math-17k dataset are used.

The following uses two Atlas 900 A3 SuperPoDs as an example to describe the operations.

  1. Convert model weights, that is, convert the HuggingFace model to the Megatron model. For details, see verl model conversion script.
    # Start the container. Change the model path as required.
    docker run -it \
    -v /qwen30b/Qwen3-30B-A3B-Instruct-2507:/qwen30b/Qwen3-30B-A3B-Instruct-2507 \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -e ASCEND_VISIBLE_DEVICES=0-15 \
    verl:v1 /bin/bash
     
    # Perform weight conversion.
    cd ~/verl
    python scripts/converter_hf_to_mcore.py \
    --hf_model_path /qwen30b/Qwen3-30B-A3B-Instruct-2507 \
    --output_path /qwen30b/Qwen3-30B-A3B-Instruct-Mcore \

    If the error information shown in the figure appears, run the following command:

    export LD_PRELOAD="/usr/local/lib/python3.10/dist-packages/sklearn/utils/../../scikit_learn.libs/libgomp-947d5fa1.so.1.0.0"

  2. Build the training script of Qwen3 30B MoE on verl. Use vLLM as the inference backend and Megatron as the training backend.

    Obtain the script example run_dapo_qwen3_30b_a3b_megatron.sh and save it to the examples_npu path in the verl directory. Create the dapo_trainer-megatron.yaml and runtime_env.yaml files in the examples_npu/config directory.

    • dapo_trainer-megatron.yaml
      # examples_npu/config/dapo_trainer-megatron.yaml
      hydra:
        searchpath:
          - file://verl/trainer/config
      defaults:
        - ppo_megatron_trainer
        - _self_
      data:
        gen_batch_size: ${data.train_batch_size}
      reward_model:
        reward_manager: dapo
        overlong_buffer: 
          enable: False # We try to avoid forgetting to set enable
          len: 0
          penalty_factor: 0.0
          log: False
      algorithm:
        filter_groups:
          _target_: verl.trainer.config.FilterGroupsConfig
          enable: False # We try to avoid forgetting to set enable
          metric: null # acc / score / seq_reward / seq_final_reward / ...
          max_num_gen_batches: 0 # Non-positive values mean no upper limit
      trainer:
        project_name: verl-dapo
    • runtime_env.yaml
      # examples_npu/config/runtime_env.yaml
      working_dir: ./
      excludes: ["/.git/"]
      env_vars:
        HCCL_EXEC_TIMEOUT: "7200"
        HCCL_CONNECT_TIMEOUT: "7200"
        VLLM_USE_V1: "1"
        VLLM_VERSION: "0.9.1"
        HCCL_IF_BASE_PORT: "23999"
        HCCL_ASYNC_ERROR_HANDLING: "0"
        P2P_HCCL_BUFFSIZE: "20"
  3. Build a Ray startup script that adapts to MindCluster. Prepare a Ray startup script on each worker node and save the script on two Atlas 900 A3 SuperPoDs. The NIC information must be configured as required, and other information can remain unchanged.

    Obtain the script example start.sh and save the script to the verl directory.

  4. Obtain the verl-resche.yaml file in Preparation of Job YAML Files, modify the parameters in the file as required, and run the following command to start the job.
    kubectl apply -f verl-resche.yaml

    After the job is started, the following iteration information is displayed.