Container Manager
请在Container Manager组件部署的节点上执行以下步骤验证Container Manager组件的安装状态。
- 查看组件服务的状态,状态应为active (running)。
systemctl status container-manager.service
回显示例如下:
1 2 3 4 5 6 7 8 9 10
● container-manager.service - Ascend container manager Loaded: loaded (/etc/systemd/system/container-manager.service; disabled; vendor preset: enabled) Active: active (running) since Wed 2025-11-26 20:56:50 UTC; 16s ago Process: 41459 ExecStart=/bin/bash -c container-manager run -ctrStrategy ringRecover -logPath=/var/log/mindx-dl/container-manager/container-manager.log >/dev/null 2>&1 & (code=exited, status=0/SUCCESS) Main PID: 41464 (container-manag) Tasks: 10 (limit: 629145) Memory: 13.3M CGroup: /system.slice/container-manager.service └─41464 /home/container-manager/container-manager run -ctrStrategy ringRecover ...
- 查看组件日志。
cat /var/log/mindx-dl/container-manager/container-manager.log
回显以Atlas 800I A3 超节点服务器为例:
1 2 3 4 5 6 7
[INFO] 2025/11/25 22:46:59.007163 1 hwlog/api.go:108 container-manager.log's logger init success [INFO] 2025/11/25 22:46:59.007288 1 command/run.go:150 init log success [INFO] 2025/11/25 22:46:59.007506 1 devmanager/devmanager.go:134 get card list from dcmi reset timeout is 60 [INFO] 2025/11/25 22:46:59.250103 1 devmanager/devmanager.go:142 deviceManager get cardList is [0 1 2 3 4 5 6 7], cardList length equal to cardNum: 8 [INFO] 2025/11/25 22:46:59.250267 1 devmanager/devmanager.go:171 the dcmi version is 25.5.0.b030 [INFO] 2025/11/25 22:46:59.250405 1 devmanager/devmanager.go:235 chipName: Ascend910, devType: Ascend910A3 ...
如果出现如下打印信息,表示组件运行正常。
... [INFO] 2025/11/25 22:46:59.289352 1 devmgr/workflow.go:57 init module <hwDev manager> success [INFO] 2025/11/25 22:46:59.293773 1 app/config.go:40 load fault config from faultCode.json success [INFO] 2025/11/25 22:46:59.293866 1 app/workflow.go:50 init module <fault manager> success [INFO] 2025/11/25 22:46:59.293901 1 app/workflow.go:76 init module <container controller> success [INFO] 2025/11/25 22:46:59.293930 1 app/workflow.go:64 init module <reset-manager> success [INFO] 2025/11/25 22:46:59.315101 378 devmgr/hwdevmgr.go:365 subscribe device fault event success ...
父主题: 组件状态确认
