Configuring Hot Switching

This section describes how to configure hot switching. For details about its features, restrictions, supported products, and working principles, see Hot Switching.

Building an Image

Use the Dockerfile to create a container image and add the startup command. Example:

# MindCluster resumable training adaptation script. TASKD_WHL is the path of the TaskD whl installation package, MINDIO_TTP_PKG is the path of the MindIO whl installation package, and MINDSPORE_WHL is the path of the MindSpore whl installation package. Set these paths as required.
# (Optional) To enable hot switching in the PyTorch framework, configure the following commands.
RUN pip3 install $TASKD_WHL
RUN pip3 install $MINDIO_TTP_PKG
RUN sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py

# (Optional) To enable hot switching in the MindSpore framework, configure the following commands.
RUN pip3 install $MINDIO_TTP_PKG
RUN pip3 install $TASKD_WHL
RUN pip3 install $MINDSPORE_WHL

Preparing a Job YAML File

Add the following fields in a job YAML file to enable hot switching and add port 9601 for TaskD communication to all pods.

... 
metadata:  
   labels:  
     ... 
     subHealthyStrategy: "hotSwitch"
...
        spec: 
...
           containers: 
... 
             ports:                           
                - containerPort: 9601          
                  name: taskd-port
...

Adapting the Training Script

After the distributed environment is initialized and the global rank can be obtained, modify the training script to start TaskD Manager in the training script.

Create a manager.py file as follows and save it to the directory where the training script is called.

from taskd.api import init_taskd_manager, start_taskd_manager
import os
 
job_id=os.getenv("MINDX_TASK_ID")
node_nums=XX         # Total number of nodes (set by yourself)
proc_per_node=XX     # Number of training processes on each node (set by yourself)
 
init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node})
start_taskd_manager()

For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.

Add the following code to the training script to start TaskD Manager.
```
export TASKD_PROCESS_ENABLE="on" 
 
# PyTorch
if [[ "${RANK}" == 0 ]]; then
    export MASTER_ADDR=${POD_IP} 
    python manager.py &            # Determined by the current path.
fi 
      
torchrun ...
 
# MindSpore
if [[ "${MS_SCHED_HOST}" == "${POD_IP}" ]]; then 
    python manager.py &   # Determined by the current path.
fi 
      
msrun ...
```
If the error message "the libtaskd.so has not been loaded" is displayed during training, import the environment variable LD_PRELOAD to the training script. This environment variable allows the system to load the specified .so file in advance. Example:
```
export LD_PRELOAD=/usr/local/Ascend/cann/lib64/libmspti.so:/usr/local/lib/python3.10/dist-packages/taskd/python/cython_api/libs/libtaskd.so
```
- libmspti.so: This .so file is provided by MindStudio and integrated in the CANN package. The default installation path is /usr/local/Ascend/cann/lib64/libmspti.so.
- libtaskd.so: This .so file is provided by TaskD. After the whl package is installed, the path is TaskD installation path/taskd/python/cython_api/libs/libtaskd.so.
  You can run the following command to query the path where TaskD is located. The Location field in the command output is the target path.
  
  pip show taskd

Parent topic: Configuring Fault Handling Policies