Upgrade Description

This section describes how to upgrade the MindCluster cluster scheduling components to their latest versions. You can upgrade them in either of the following ways:

  • Full upgrade: This upgrade mode not only upgrades the binary image file of each component, but also allows you to modify the configuration files of the components after the upgrade. Also, it supports cross-version upgrade. For example, you can upgrade the components from 5.0.x to 7.0.x.
  • Image upgrade: This upgrade mode updates only the binary file of each component. It does not support changes to permissions or startup parameters, and no environment check is required before proceeding. Upgrades are limited to components within the same version.
    Table 1 Upgrade methods

    Upgrade Method

    Cross-Version Upgrade Supported

    Whether to Stop Training/Inference Jobs

    Reference

    Full upgrade

    Yes

    Yes

    Sections 7.1 to 7.5

    Image upgrade

    No

    No

    Section 7.6

    This section does not apply if you have modified the source code (excluding configuration files) of MindCluster cluster scheduling components from a previous version. In such cases, analyze the code differences before performing the upgrade.

Environment Check

Before upgrading components, select the target component for check based on the actual installation scenario.

  1. Check whether any job is running. If there is, wait until the job is complete or stop it in advance, and then upgrade MindCluster components.
    1. Run the following command to check whether any job is running:
      kubectl get pods -A
      Command output:
      1
      2
      NAMESPACE        NAME                                       READY   STATUS    RESTARTS         AGE
      default          ubuntu-pod                                 1/1     Running   32 (118m ago)    3d18h ...  
      
    2. Go to the directory where the job YAML file is stored and run the following command to stop it:
      kubectl delete -f  xxx.yaml              # xxx indicates the name of the job YAML file. Set it as required.
  2. (Optional) Check whether pingmesh UnifiedBus network detection is disabled.
    1. Log in to the environment and go to the directory generated after NodeD decompression.
    2. Run the following commands to edit the pingmesh-config file:
      kubectl edit cm -n cluster-system   pingmesh-config

      If the following information is displayed, pingmesh UnifiedBus network detection is disabled. In this case, skip Step 3.

      Error from server (NotFound): configmaps "pingmesh-config" not found
    3. (Optional) Change the value of activate.
      • If the SuperPoD ID exists in the pingmesh-config file, change the value of activate under the SuperPoD ID to off.
      • If the SuperPoD ID does not exist in the pingmesh-config file, change the value of activate in either of the following ways:
        • Add the SuperPoD information to the configuration file and set activate to off.
        • Delete all SuperPoD information from the pingmesh-config file and set activate under global to off.
  3. Check the installed MindCluster components.
    • (Optional) Check TaskD. Access the container and check the TaskD installation status.
      docker run -it  {Training image name}:tag /bin/bash
      pip show taskd

      If the following information is displayed, TaskD has been installed in the image.

      Name: taskd
      Version: x.x.x
      Summary: Ascend MindCluster taskd is a new library for training management
      Home-page: UNKNOWN
      Author: 
      Author-email: 
      License: UNKNOWN
      Location: /usr/local/python3/lib/python3.10/site-packages
      Requires: grpcio, protobuf, pyOpenSSL, torch, torch-npu
      Required-by:
    • (Optional) Check other components. Check whether other components are installed in a cluster by referring to Confirming Component Status.
  4. (Optional) If the MindCluster cluster scheduling components have not been installed, install them by referring to Installation and Deployment. For details about how to install TaskD, see Image Creation.