Configuring Elastic Training

This section describes how to configure elastic training. For details about its features, restrictions, supported products, and working principles, see Elastic Training.

Prerequisite

Procedure

  1. After the distributed environment is initialized and the global rank can be obtained, modify the training script to start TaskD Manager in the training script.
    1. Create a manager.py file and save it to the directory where the training script is called. The content of the manager.py file is as follows:
      from taskd.api import init_taskd_manager, start_taskd_manager
      import os
      
      job_id=os.getenv("MINDX_TASK_ID")
      node_nums=XX         # Total number of nodes (set by yourself)
      proc_per_node=XX     # Number of training processes on each node (set by yourself)
      
      init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node})
      start_taskd_manager()

      For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.

    2. Add the following code to the training script to start TaskD Manager.
      sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py
      export TASKD_PROCESS_ENABLE="on"
      if [[ "${RANK}" == 0 ]]; then
          export MASTER_ADDR=${POD_IP}
          python manager.py &            # Determined by the current path.
      fi
          
      torchrun ...
  2. Modify the job YAML file.
    Add the following fields in bold in a job YAML file to enable elastic training and add port 9601 for TaskD communication to all pods.
    ...  
       labels:  
         ...  
         fault-scheduling: "force"
     ... 
    ...  
       annotations:  
         ...
         wait-reschedule-timeout: "270" # Timeout interval for process-level recovery to wait for rescheduling of the faulty node. The default value is 270 seconds. The value ranges from 30 to 270. If both process-level recovery and elastic training are enabled, if the faulty node is successfully scheduled after the waiting time, process-level recovery is performed. Otherwise, elastic training is triggered.
         recover-strategy: "elastic-training"    # Recovery policy. elastic-training indicates that elastic training is enabled.
     ... 
    ...
    spec:
      replicaSpecs:
        Master:
          template:
            spec:
              containers:
              - name: ascend # do not modify
                env:
                  - name: MINDIO_WAIT_MINDX_TIME        # Process-level recovery is not enabled. You are advised to set this parameter to a value greater than 60 when elastic training is enabled.
                    value: "60"
                args:
                  - | 
                    ...
                    bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                      ...
                 ports:                           
                    - containerPort: 9601          
                      name: taskd-port
    ...
        Worker:
          template:
            spec:
              containers:
              - name: ascend # do not modify
                env:
                  - name: MINDIO_WAIT_MINDX_TIME        # Process-level recovery is not enabled. You are advised to set this parameter to a value greater than 60 when elastic training is enabled.
                    value: "60"
                args:
                  - |
                    ...
                    bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                      ...
                 ports:                           
                    - containerPort: 9601          
                      name: taskd-port
    ...
  3. Modify the training framework code.
    Go to the mindcluster-deploy repository, access a branch based on mindcluster-deploy Version Description, obtain the train_start.sh file in the samples/train/resumable-training/fault-tolerance/without-ranktable/pytorch/Qwen3 directory, and create the following directory structure on the management node.
    root@ubuntu:/data/atlas_dls/public/code/QWEN3_for_PyTorch_2.7_code/scripts#
    scripts/
    └── train_start.sh