本章节不适用的场景:用户对旧版本MindCluster集群调度组件的源代码(不含配置文件)进行了修改,请分析版本代码差异后再进行升级。
请根据实际安装场景,选择相应的组件进行检查。
kubectl get pods -A
1 2 | NAMESPACE NAME READY STATUS RESTARTS AGE default ubuntu-pod 1/1 Running 32 (118m ago) 3d18h ... |
kubectl delete -f xxx.yaml # xxx表示任务YAML的名称,请根据实际情况填写
kubectl edit cm -n cluster-system pingmesh-config
docker run -it {训练镜像名称}:tag /bin/bash pip show mindx-elastic
回显如下,表示此镜像中已安装Elastic Agent组件。
1 2 3 4 5 6 7 8 9 10 | Name: mindx_elastic Version: 7.0.rc1 Summary: Ascend MindX Elastic is a new library for fault tolerance training. Home-page: Author: Author-email: License: Location: /usr/local/python3/lib/python3.8/site-packages Requires: Required-by: |
docker run -it {训练镜像名称}:tag /bin/bash pip show taskd
回显如下,表示镜像中已安装taskd组件。
Name: taskd Version: 7.0rc1 Summary: Ascend MindCluster taskd is a new library for training management Home-page: UNKNOWN Author: Author-email: License: UNKNOWN Location: /usr/local/python3/lib/python3.8/site-packages Requires: grpcio, protobuf, pyOpenSSL, torch, torch-npu Required-by: