Configuring Pod-Level Rescheduling
This section describes how to configure pod-level rescheduling. For details about its features, restrictions, supported products, and working principles, see Pod-Level Rescheduling.
Building an Image
Add the startup command for using Dockerfile to build a container image. Example:
# MindCluster resumable training adaptation script. TASKD_WHL is the path of the TaskD whl installation package. Set it as required.
# (Optional) In the PyTorch framework, if graceful fault tolerance, pod-level rescheduling, or process-level rescheduling is required, configure the following commands.
RUN pip install $TASKD_WHL
RUN sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py
# (Optional) In the MindSpore framework, if pod-level rescheduling is required, configure the following command.
RUN pip install $TASKD_WHL
Preparation of a Job YAML File
Add the following fields in a job YAML file to enable pod-level rescheduling and add port 9601 for TaskD communication to all pods.
...
metadata:
labels:
...
pod-rescheduling: "on"
fault-scheduling: "force"
...
spec:
...
containers:
...
ports:
- containerPort: 9601
name: taskd-port
...
Adapting the Training Script
- Add the following fields in bold to the startup script (for example, train_start.sh).
... export MS_ENABLE_TFT='{RSC:1}' # Enable pod-level rescheduling in MindSpore scenarios. ... # (Optional) In PyTorch scenarios, set the number of restarts in the container and the monitoring interval of training processes. logger "server id is: ""${server_id}" if [ "${framework}" == "PyTorch" ]; then get_env_for_pytorch_multi_node_job DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT --max_restarts 32767"--max_restarts indicates the maximum number of faults that can be triggered in the container. The value is an integer. If the number of times exceeds the upper limit, the PyTorch training process exits directly. If this parameter is not set, the default value 32767 is used.
- After the distributed environment is initialized and the global rank can be obtained, modify the training script to start TaskD Manager in the training script.
- Create a manager.py file as follows and save it to the directory where the training script is called.
from taskd.api import init_taskd_manager, start_taskd_manager import os job_id=os.getenv("MINDX_TASK_ID") node_nums=XX # Total number of nodes (set by yourself) proc_per_node=XX # Number of training processes on each node (set by yourself) init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node}) start_taskd_manager()
For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.
- Add the following code to the training script (for example, train_start.sh) to start TaskD Manager. In the code:
- TASKD_SO_PATH and export LD_PRELOAD statements are used to configure the path of libtaskd.so after TaskD is installed to the environment variable LD_PRELOAD. If the two statements fail to be configured, run the pip show taskd command to obtain the value of Location, combine the value with /taskd/python/cython_api/libs/libtaskd.so, and run the export command.
- Configuration description of TASKD_PROCESS_ENABLE: If no recovery policy is configured under recover-strategy in the job YAML file and hot switching is disabled, export TASKD_PROCESS_ENABLE="off" needs to be configured. If a recovery policy is configured under recover-strategy or hot switching is enabled, export TASKD_PROCESS_ENABLE="off" does not need to be configured.
TASKD_SO_PATH="$(pip show taskd | awk '/^Location: / {print $2"/taskd/python/cython_api/libs/libtaskd.so"}')" export LD_PRELOAD=$TASKD_SO_PATH:$LD_PRELOAD export TASKD_PROCESS_ENABLE="off" # PyTorch if [[ "${RANK}" == 0 ]]; then export MASTER_ADDR=${POD_IP} python manager.py & # Determined by the current path. fi # MindSpore if [[ "${MS_SCHED_HOST}" == "${POD_IP}" ]]; then python manager.py & # Determined by the current path. fi
- Create a manager.py file as follows and save it to the directory where the training script is called.
Parent topic: Configuring Fault Handling Policies