Configuring Dying Gasp Checkpoint Saving

This section describes how to save dying gasp checkpoints. For more details, see Saving Dying Gasp Checkpoints.

Building an Image

Add the startup command for using Dockerfile to build a container image.

... 
# Adaptation script for MindCluster resumable training without loss
RUN pip3 install $TASKD_WHL 
RUN pip3 install $MINDIO_TTP_PKG 

# (Optional) Enable graceful fault tolerance, pod-level rescheduling, or process-level rescheduling.
RUN sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py

Preparation of a Job YAML File

Add the following fields to a training job YAML file to enable process-level recovery. recover-strategy specifies the policy used for training process recovery. dump indicates that the dying gasp checkpoint is enabled. Add ttp-port (8000) and add port 9601 used for TaskD communication under ports.

Saving dying gasp checkpoint can be used as a policy named "dump" of recover-strategy for process-level recovery. Example:

... 
metadata:  
   labels:  
     ...  
 ... 
...  
   annotations:  
     ...  
     recover-strategy: "dump"       # Recovery policy. dump indicates that dying gasp saving is enabled.
 ... 
  
... 
spec:  
   replicaSpecs:  
      Master: 
         template: 
            spec: 
              containers: 
                 env: 
                   - name: TTP_PORT 
                     value: "8000" 
                 args: […]
                 ports: 
                   - containerPort: 8000 
                     name: ttp-port 
                   - containerPort: 9601  
                     name: taskd-port
     ...  
     Worker: 
        template: 
          spec: 
            containers: 
               env: 
                 - name: TTP_PORT 
                   value: "8000" 
               args: […]
               ports: 
                 - containerPort: 8000 
                   name: ttp-port 
                 - containerPort: 9601  
                   name: taskd-port
 ...

Adapting the Training Script

After the distributed environment is initialized and the global rank can be obtained, modify the training script to start TaskD Manager in the training script.

Create a manager.py file as follows and save it to the directory where the training script is called.

from taskd.api import init_taskd_manager, start_taskd_manager
import os
 
job_id=os.getenv("MINDX_TASK_ID")
node_nums=XX         # Total number of nodes (set by yourself)
proc_per_node=XX     # Number of training processes on each node (set by yourself)
 
init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node})
start_taskd_manager()

For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.

Add the following code to the training script to start TaskD Manager.

export TASKD_PROCESS_ENABLE="on" 
 
# PyTorch
if [[ "${RANK}" == 0 ]]; then
    export MASTER_ADDR=${POD_IP} 
    python manager.py &            # Determined by the current path.
fi 
      
torchrun ...

Add the --max_restarts parameter in the startup script, for example, train_start.sh.
```
... 
   logger "server id is: ""${server_id}" 
   if [ "${framework}" == "PyTorch" ]; then 
     get_env_for_pytorch_multi_node_job 
     DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT  --max_restarts 32767" 
 ...
```
--max_restarts indicates the maximum number of faults that can be triggered in the container. The value is an integer. If the number of times exceeds the upper limit, the PyTorch training process exits directly. If this parameter is not set, the default value 32767 is used.

If the error message "the libtaskd.so has not been loaded" is displayed during training, import the environment variable LD_PRELOAD to the training script. This environment variable allows the system to load the specified .so file in advance. Example:

export LD_PRELOAD=/usr/local/Ascend/cann/lib64/libmspti.so:/usr/local/lib/python3.10/dist-packages/taskd/python/cython_api/libs/libtaskd.so

libmspti.so: This .so file is provided by MindStudio and integrated in the CANN package. The default installation path is /usr/local/Ascend/cann/lib64/libmspti.so.
libtaskd.so: This .so file is provided by TaskD. After the whl package is installed, the path is TaskD installation path/taskd/python/cython_api/libs/libtaskd.so.
You can run the following command to query the path where TaskD is located. The Location field in the command output is the target path.
```
pip show taskd
```

Parent topic: Configuring Training Recovery