Configuring Elastic Training
This section describes how to configure elastic training. For details about its features, restrictions, supported products, and working principles, see Elastic Training.
Prerequisite
- Ascend Docker Runtime, Ascend Operator, ClusterD, Ascend Device Plugin, and Volcano have been installed on corresponding nodes. (The versions of the preceding MindCluster components must match those of TaskD.)
- torch_npu (≥ 7.1.RC1 or later), CANN (≥ 8.2.RC1), TaskD, and MindIO (≥ 7.2.RC1) have been installed in the container.
Procedure
- After the distributed environment is initialized and the global rank can be obtained, modify the training script to start TaskD Manager in the training script.
- Create a manager.py file and save it to the directory where the training script is called. The content of the manager.py file is as follows:
from taskd.api import init_taskd_manager, start_taskd_manager import os job_id=os.getenv("MINDX_TASK_ID") node_nums=XX # Total number of nodes (set by yourself) proc_per_node=XX # Number of training processes on each node (set by yourself) init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node}) start_taskd_manager()
For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.
- Add the following code to the training script to start TaskD Manager.
sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py export TASKD_PROCESS_ENABLE="on" if [[ "${RANK}" == 0 ]]; then export MASTER_ADDR=${POD_IP} python manager.py & # Determined by the current path. fi torchrun ...
- Create a manager.py file and save it to the directory where the training script is called. The content of the manager.py file is as follows:
- Modify the job YAML file.Add the following fields in bold in a job YAML file to enable elastic training and add port 9601 for TaskD communication to all pods.
... labels: ... fault-scheduling: "force" ... ... annotations: ... wait-reschedule-timeout: "270" # Timeout interval for process-level recovery to wait for rescheduling of the faulty node. The default value is 270 seconds. The value ranges from 30 to 270. If both process-level recovery and elastic training are enabled, if the faulty node is successfully scheduled after the waiting time, process-level recovery is performed. Otherwise, elastic training is triggered. recover-strategy: "elastic-training" # Recovery policy. elastic-training indicates that elastic training is enabled. ... ... spec: replicaSpecs: Master: template: spec: containers: - name: ascend # do not modify env: - name: MINDIO_WAIT_MINDX_TIME # Process-level recovery is not enabled. You are advised to set this parameter to a value greater than 60 when elastic training is enabled. value: "60" args: - | ... bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \ ... ports: - containerPort: 9601 name: taskd-port ... Worker: template: spec: containers: - name: ascend # do not modify env: - name: MINDIO_WAIT_MINDX_TIME # Process-level recovery is not enabled. You are advised to set this parameter to a value greater than 60 when elastic training is enabled. value: "60" args: - | ... bash scripts/train_start.sh /job/code /job/output pretrain_gpt.py \ ... ports: - containerPort: 9601 name: taskd-port ... - Modify the training framework code.Go to the mindcluster-deploy repository, access a branch based on mindcluster-deploy Version Description, obtain the train_start.sh file in the samples/train/resumable-training/fault-tolerance/without-ranktable/pytorch/Qwen3 directory, and create the following directory structure on the management node.
root@ubuntu:/data/atlas_dls/public/code/QWEN3_for_PyTorch_2.7_code/scripts# scripts/ └── train_start.sh
Parent topic: Configuring Fault Handling Policies