Replacing Elastic Agent with TaskD

Elastic Agent has reached the end of life. This section describes how to replace Elastic Agent with TaskD.

Prerequisite

  • You have checked the upgrade environment.
  • Elastic Agent has been installed in the training image.

Procedure

  1. Download the TaskD installation package of the latest version by referring to Obtaining Software Packages.
  2. After the download is complete, go to the directory where the installation package is stored and decompress it.
  3. Run the ls -l command. The following information is displayed:
    -rw-r--r-- 1 root root 6134726 Nov 10 10:32 Ascend-mindxdl-taskd_{version}_linux-aarch64.zip
    -r-------- 1 root root 6205642 Nov  5 23:38 taskd-{version}-py3-none-linux_aarch64.whl
  4. Uninstall Elastic Agent and install TaskD based on the existing training image.
    1. Run the training image. Example:
      docker run -it  -v /host/packagepath:/container/packagepath training_image:latest /bin/bash
    2. Uninstall Elastic Agent.
      pip uninstall mindx-elastic -y

      If the following information is displayed, the uninstallation is successful:

      Successfully uninstalled mindx_elastic-{version}
    3. Delete the code for enabling Elastic Agent.
      sed -i '/mindx_elastic.api/d' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py

      (Optional) Check whether the Elastic Agent embedding code has been deleted from the corresponding file.

      vi $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py
    4. Install TaskD.
      pip install taskd-{version}-py3-none-linux_aarch64.whl

      The installation is successful if the following information is displayed:

      Successfully installed taskd-{version}
      Enable TaskD.
      sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py
    5. After the latest version of TaskD is installed, save the container as the new image.
      docker ps

      Command output:

      CONTAINER ID   IMAGE                  COMMAND                  CREATED        STATUS        PORTS     NAMES 
      bb118ca00041    f76142d63d3a           "/bin/bash -c 'sleep…"   2 hours ago    Up 2 hours              k8s_ascend_default-last-test-deepseek2-60b

      Commit the container as the new training container image. Note that the tag of the new image is different from that of the old image. Example:

      docker commit bb118ca00041 newimage:latest
  5. Check whether TaskD is replaced. Also, check whether the component status is normal by referring to Check TaskD.
  6. Modify the training script (for example, train_start.sh) and job YAML file.
    1. Create a manager.py file as follows and save it to the directory where the training script is called.
      from taskd.api import init_taskd_manager, start_taskd_manager
      import os
       
      job_id=os.getenv("MINDX_TASK_ID")
      node_nums=XX         # Total number of nodes (set by yourself)
      proc_per_node=XX     # Number of training processes on each node (set by yourself)
       
      init_taskd_manager({"job_id":job_id, "node_nums": node_nums, "proc_per_node": proc_per_node})
      start_taskd_manager()

      For details about the parameters in the manager.py file, see def init_taskd_manager(config:dict) -> bool:.

    2. Add the following code to the training script to start TaskD Manager.
      export TASKD_PROCESS_ENABLE="on" 
      # The PyTorch framework is used as an example.
      if [[ "${RANK}" == 0 ]]; then
          export MASTER_ADDR=${POD_IP} 
          python manager.py &            # Determined by the current path.
      fi 
            
      torchrun ...
    3. In the job YAML file, add port 9601 for TaskD communication to all pods.
      ...
              spec: 
      ...
                 containers: 
      ... 
                   ports:                           
                      - containerPort: 9601          
                        name: taskd-port
      ...