Configuring Process-Level Online Recovery

This section describes how to configure process-level online recovery. For details about its features, restrictions, supported products, and working principles, see Process-Level Online Recovery.

Building an Image

Add the startup command for using Dockerfile to build a container image.

# MindCluster resumable training adaptation script. TASKD_WHL is the path of the TaskD whl installation package, and MINDIO_TTP_PKG is the path of the MindIO whl installation package. Set them as required.
# (Optional) In the PyTorch framework, if graceful fault tolerance, pod-level rescheduling, or process-level rescheduling is required, configure the following commands.
RUN pip3 install $TASKD_WHL 
RUN pip3 install $MINDIO_TTP_PKG
RUN sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py

# (Optional) In the MindSpore framework, if process-level online recovery is used, configure the following commands.
RUN pip3 install $TASKD_WHL
RUN pip3 install $MINDIO_TTP_PKG

Preparation of a Job YAML File

Add the following fields in a job YAML file to enable process-level recovery and add port 9601 for TaskD communication to all pods.

...  
   labels:  
     ...  
     fault-scheduling: "grace"
 ... 
...  
   annotations:  
     ...  
     recover-strategy: "retry"  # Recovery policy (retry: process-level online recovery; recover: process-level rescheduling; recover-in-place: process-level in-place recovery; elastic-training: elastic training; dump: saving dying gasp; exit: exiting training). The six policies can be combined as required, and the policies are separated by commas (,).
 ... 
...
spec:
  replicaSpecs:
    Master:
      template:
        spec:
          containers:
          - name: ascend # do not modify
            ...
            args:
              - | 
                ... 
                bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                  ...
             ports:                           
                - containerPort: 9601          
                  name: taskd-port
...
    Worker:
      template:
        spec:
          containers:
          - name: ascend # do not modify
            ...
            args:
              - |
                ...
                bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \
                  ...
             ports:                           
                - containerPort: 9601          
                  name: taskd-port
...
In the MindSpore scenario, you need to modify the model parameter configuration YAML. Open the QWEN3_for_MS_code/configs/qwen3/pretrain_qwen3_32b_4k.yaml file and add the following fields in bold to the code.
# mindspore context init config
context:
  mode: 0  #0--Graph Mode; 1-Pynative Mode
  device_target: "Ascend"
  graph_kernel_flags: "--disable_pass=cluster.floatstatus_fusion,preprocess.depend_elimination"
  max_call_depth: 10000
  max_device_memory: "59GB"
  mempool_block_size: "59GB"
  save_graphs: True
  save_graphs_path: "./graph"
  device_id: 0
  jit_config:
    jit_level: "O1"
  memory_optimize_level: "00"
  ascend_config:
    hccl_watchdog: False

Adapting the Training Script

  1. After the distributed environment is initialized and the global rank can be obtained, modify the training script to start TaskD Manager in the training script.
    1. Create a manager.py file as follows and save it to the directory where the training script is called.
      from taskd.api import init_taskd_manager, start_taskd_manager
      import os
       
      job_id=os.getenv("MINDX_TASK_ID")
      node_nums=XX         # Total number of nodes (set by yourself)
      proc_per_node=XX     # Number of training processes on each node (set by yourself)
       
      init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node})
      start_taskd_manager()

      For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.

    2. Add the following code to the training script (for example, train_start.sh) to start TaskD Manager. In the code, TASKD_SO_PATH and export LD_PRELOAD statements are used to configure the path of libtaskd.so to the environment variable LD_PRELOAD after TaskD is installed. If the two statements fail to be configured, run the pip show taskd command to obtain the value of Location, combine the value with /taskd/python/cython_api/libs/libtaskd.so, and run the export command.
      TASKD_SO_PATH="$(pip show taskd | awk '/^Location: / {print $2"/taskd/python/cython_api/libs/libtaskd.so"}')"
      export LD_PRELOAD=$TASKD_SO_PATH:$LD_PRELOAD
      export TASKD_PROCESS_ENABLE="on"
      # PyTorch
      if [[ "${RANK}" == 0 ]]; then
         export MASTER_ADDR=${POD_IP}
          python manager.py &           # Determined by the current path.
      fi
      # MindSpore
      if [[ "${MS_SCHED_HOST}" == "${POD_IP}" ]]; then
          python manager.py &   # Determined by the current path.
      fi
  2. (Optional) Add the --max_restarts parameter in the startup script, for example, train_start.sh.
    ... 
       logger "server id is: ""${server_id}" 
       if [ "${framework}" == "PyTorch" ]; then 
         get_env_for_pytorch_multi_node_job 
         DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT --max_restarts 32767" 

    --max_restarts indicates the maximum number of faults that can be triggered in the container. The value is an integer. If the number of times exceeds the upper limit, the PyTorch training process exits directly. If this parameter is not set, the default value 32767 is used.

    • In the MindSpeed scenario, you need to modify the train_start.sh script and add the following information in bold to the script.
      export HCCL_OP_RETRY_ENABLE="L0:0, L1:1, L2:1"   # Enable HCCL operator re-execution (operator-level online recovery). Re-execution occurs when an SDMA or RDMA CQE error is reported during the execution of a communication operator. In this case, HCCL attempts to re-run the operator.
      export HCCL_ASYNC_ERROR_HANDLING=0
    • In the MindFormers scenario, you need to modify the msrun_launcher.sh script and add the following information in bold to the script.
      export MS_ENABLE_TFT='{UCE:1, HCCE:1}'     # Enable process-level online recovery for on-chip memory faults and network faults, respectively.
      export HCCL_OP_RETRY_ENABLE="L0:0, L1:1, L2:1" # This environment variable is used to configure whether to enable HCCL operator re-execution. Re-execution occurs when an SDMA or RDMA CQE error is reported during the execution of a communication operator. In this case, HCCL attempts to re-run the operator.

To test the process-level online recovery function, configure environment variables by referring to Process-level Online Recovery.