Replacing Elastic Agent with TaskD
Elastic Agent has reached the end of life. This section describes how to replace Elastic Agent with TaskD.
Prerequisite
- You have checked the upgrade environment.
- Elastic Agent has been installed in the training image.
Procedure
- Download the TaskD installation package of the latest version by referring to Obtaining Software Packages.
- After the download is complete, go to the directory where the installation package is stored and decompress it.
- Run the ls -l command. The following information is displayed:
-rw-r--r-- 1 root root 6134726 Nov 10 10:32 Ascend-mindxdl-taskd_{version}_linux-aarch64.zip -r-------- 1 root root 6205642 Nov 5 23:38 taskd-{version}-py3-none-linux_aarch64.whl - Uninstall Elastic Agent and install TaskD based on the existing training image.
- Run the training image. Example:
docker run -it -v /host/packagepath:/container/packagepath training_image:latest /bin/bash
- Uninstall Elastic Agent.
pip uninstall mindx-elastic -y
If the following information is displayed, the uninstallation is successful:
Successfully uninstalled mindx_elastic-{version} - Delete the code for enabling Elastic Agent.
sed -i '/mindx_elastic.api/d' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py(Optional) Check whether the Elastic Agent embedding code has been deleted from the corresponding file.
vi $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py - Install TaskD.
pip install taskd-{version}-py3-none-linux_aarch64.whlThe installation is successful if the following information is displayed:
Successfully installed taskd-{version}Enable TaskD.sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py - After the latest version of TaskD is installed, save the container as the new image.
docker ps
Command output:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES bb118ca00041 f76142d63d3a "/bin/bash -c 'sleep…" 2 hours ago Up 2 hours k8s_ascend_default-last-test-deepseek2-60b
Commit the container as the new training container image. Note that the tag of the new image is different from that of the old image. Example:
docker commit bb118ca00041 newimage:latest
- Run the training image. Example:
- Check whether TaskD is replaced. Also, check whether the component status is normal by referring to Check TaskD.
- Modify the training script (for example, train_start.sh) and job YAML file.
- Create a manager.py file as follows and save it to the directory where the training script is called.
from taskd.api import init_taskd_manager, start_taskd_manager import os job_id=os.getenv("MINDX_TASK_ID") node_nums=XX # Total number of nodes (set by yourself) proc_per_node=XX # Number of training processes on each node (set by yourself) init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node}) start_taskd_manager()
For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.
- Add the following code to the training script to start TaskD Manager.
export TASKD_PROCESS_ENABLE="on" # The PyTorch framework is used as an example. if [[ "${RANK}" == 0 ]]; then export MASTER_ADDR=${POD_IP} python manager.py & # Determined by the current path. fi torchrun ... - In the job YAML file, add port 9601 for TaskD communication to all pods.
... spec: ... containers: ... ports: - containerPort: 9601 name: taskd-port ...
- Create a manager.py file as follows and save it to the directory where the training script is called.
Parent topic: Installation