升级Container Manager
在物理机上直接替换Container Manager二进制升级组件。
- 以root用户登录Container Manager组件部署的节点。
- 将获取到的Container Manager软件包上传至服务器的任意目录(如“/tmp/container-manager”)。
- 进入“/tmp/container-manager”目录并进行解压操作。
unzip Ascend-mindxdl-container-manager_{version}_linux-{arch}.zip
<version>为软件包的版本号;<arch>为CPU架构。
- 依次执行以下命令,升级Container Manager组件。
# 停止Container Manager系统服务,并删除对应Container Manager二进制文件 systemctl stop container-manager.service chattr -i /usr/local/bin/container-manager rm -f /usr/local/bin/container-manager # 从解压文件中获取新二进制文件,替换旧Container Manager二进制文件 cp /tmp/container-manager/container-manager /usr/local/bin chmod 500 /usr/local/bin/container-manager # 重启Container Manager系统服务 systemctl daemon-reload systemctl start container-manager.service
- 验证Container Manager组件的升级状态。
- 查看组件服务的状态,状态应为active (running)。
systemctl status container-manager.service
回显示例:
1 2 3 4 5 6 7 8 9 10
● container-manager.service - Ascend container manager Loaded: loaded (/etc/systemd/system/container-manager.service; disabled; vendor preset: enabled) Active: active (running) since Wed 2025-11-26 20:56:50 UTC; 16s ago Process: 41459 ExecStart=/bin/bash -c container-manager run -ctrStrategy ringRecover -logPath=/var/log/mindx-dl/container-manager/container-manager.log >/dev/null 2>&1 & (code=exited, status=0/SUCCESS) Main PID: 41464 (container-manag) Tasks: 10 (limit: 629145) Memory: 13.3M CGroup: /system.slice/container-manager.service └─41464 /home/container-manager/container-manager run -ctrStrategy ringRecover ...
- 查看组件日志。
cat /var/log/mindx-dl/container-manager/container-manager.log
回显以Atlas 800I A3 超节点服务器为例:
1 2 3 4 5 6 7
[INFO] 2025/11/25 22:46:59.007163 1 hwlog/api.go:108 container-manager.log's logger init success [INFO] 2025/11/25 22:46:59.007288 1 command/run.go:150 init log success [INFO] 2025/11/25 22:46:59.007506 1 devmanager/devmanager.go:134 get card list from dcmi reset timeout is 60 [INFO] 2025/11/25 22:46:59.250103 1 devmanager/devmanager.go:142 deviceManager get cardList is [0 1 2 3 4 5 6 7], cardList length equal to cardNum: 8 [INFO] 2025/11/25 22:46:59.250267 1 devmanager/devmanager.go:171 the dcmi version is 25.5.0.b030 [INFO] 2025/11/25 22:46:59.250405 1 devmanager/devmanager.go:235 chipName: Ascend910, devType: Ascend910A3 ...
如果出现如下打印信息,表示组件运行正常。
... [INFO] 2025/11/25 22:46:59.289352 1 devmgr/workflow.go:57 init module <hwDev manager> success [INFO] 2025/11/25 22:46:59.293773 1 app/config.go:40 load fault config from /home/faultCode.json success [INFO] 2025/11/25 22:46:59.293866 1 app/workflow.go:50 init module <fault manager> success [INFO] 2025/11/25 22:46:59.293901 1 app/workflow.go:76 init module <container controller> success [INFO] 2025/11/25 22:46:59.293930 1 app/workflow.go:64 init module <reset-manager> success [INFO] 2025/11/25 22:46:59.315101 378 devmgr/hwdevmgr.go:365 subscribe device fault event success ...
- 查看组件服务的状态,状态应为active (running)。
父主题: 升级