Upgrade Description
This section describes how to upgrade the MindCluster cluster scheduling components to their latest versions. You can upgrade them in either of the following ways:
- Full upgrade: This upgrade mode not only upgrades the binary image file of each component, but also allows you to modify the configuration files of the components after the upgrade. Also, it supports cross-version upgrade. For example, you can upgrade the components from 5.0.x to 7.0.x.
- Image upgrade: This upgrade mode updates only the binary file of each component. It does not support changes to permissions or startup parameters, and no environment check is required before proceeding. Upgrades are limited to components within the same version.
Table 1 Upgrade methods Upgrade Method
Cross-Version Upgrade Supported
Whether to Stop Training/Inference Jobs
Reference
Full upgrade
Yes
Yes
Image upgrade
No
No
Section 7.6
This section does not apply if you have modified the source code (excluding configuration files) of MindCluster cluster scheduling components from a previous version. In such cases, analyze the code differences before performing the upgrade.
Environment Check
Before upgrading components, select the target component for check based on the actual installation scenario.
- Check whether any job is running. If there is, wait until the job is complete or stop it in advance, and then upgrade MindCluster components.
- Run the following command to check whether any job is running:
kubectl get pods -A
Command output:1 2
NAMESPACE NAME READY STATUS RESTARTS AGE default ubuntu-pod 1/1 Running 32 (118m ago) 3d18h ...
- Go to the directory where the job YAML file is stored and run the following command to stop it:
kubectl delete -f xxx.yaml # xxx indicates the name of the job YAML file. Set it as required.
- Run the following command to check whether any job is running:
- (Optional) Check whether pingmesh UnifiedBus network detection is disabled.
- Log in to the environment and go to the directory generated after NodeD decompression.
- Run the following commands to edit the pingmesh-config file:
kubectl edit cm -n cluster-system pingmesh-config
If the following information is displayed, pingmesh UnifiedBus network detection is disabled. In this case, skip Step 3.
Error from server (NotFound): configmaps "pingmesh-config" not found
- (Optional) Change the value of activate.
- If the SuperPoD ID exists in the pingmesh-config file, change the value of activate under the SuperPoD ID to off.
- If the SuperPoD ID does not exist in the pingmesh-config file, change the value of activate in either of the following ways:
- Add the SuperPoD information to the configuration file and set activate to off.
- Delete all SuperPoD information from the pingmesh-config file and set activate under global to off.
- Check the installed MindCluster components.
- (Optional) Check TaskD. Access the container and check the TaskD installation status.
docker run -it {Training image name}:tag /bin/bash pip show taskdIf the following information is displayed, TaskD has been installed in the image.
Name: taskd Version: x.x.x Summary: Ascend MindCluster taskd is a new library for training management Home-page: UNKNOWN Author: Author-email: License: UNKNOWN Location: /usr/local/python3/lib/python3.10/site-packages Requires: grpcio, protobuf, pyOpenSSL, torch, torch-npu Required-by:
- (Optional) Check other components. Check whether other components are installed in a cluster by referring to Confirming Component Status.
- (Optional) Check TaskD. Access the container and check the TaskD installation status.
- (Optional) If the MindCluster cluster scheduling components have not been installed, install them by referring to Installation and Deployment. For details about how to install TaskD, see Image Creation.