Container Manager
Perform the following steps on the node where Container Manager is deployed to verify its installation status:
- Check the component service status, which should be active (running).
systemctl status container-manager.service
Command output:
1 2 3 4 5 6 7 8 9 10
● container-manager.service - Ascend container manager Loaded: loaded (/etc/systemd/system/container-manager.service; disabled; vendor preset: enabled) Active: active (running) since Wed 2025-11-26 20:56:50 UTC; 16s ago Process: 41459 ExecStart=/bin/bash -c container-manager run -ctrStrategy ringRecover -logPath=/var/log/mindx-dl/container-manager/container-manager.log >/dev/null 2>&1 & (code=exited, status=0/SUCCESS) Main PID: 41464 (container-manag) Tasks: 10 (limit: 629145) Memory: 13.3M CGroup: /system.slice/container-manager.service └─41464 /home/container-manager/container-manager run -ctrStrategy ringRecover ...
If information similar to the following is displayed, ignore it. The possible cause is that the IP address and subnet mask of the RoCE NIC are not configured.
[dsmi_common_interface.c:1017][ascend][curpid:244135,244135][drv][dmp][dsmi_get_device_ip_address]devid 0 dsmi_cmd_get_device_ip_address return 1 error!
- View component logs.
cat /var/log/mindx-dl/container-manager/container-manager.log
Command output (Atlas 800I A3 SuperPoD Server as an example):
1 2 3 4 5 6 7
[INFO] 2025/11/25 22:46:59.007163 1 hwlog/api.go:108 container-manager.log's logger init success [INFO] 2025/11/25 22:46:59.007288 1 command/run.go:150 init log success [INFO] 2025/11/25 22:46:59.007506 1 devmanager/devmanager.go:134 get card list from dcmi reset timeout is 60 [INFO] 2025/11/25 22:46:59.250103 1 devmanager/devmanager.go:142 deviceManager get cardList is [0 1 2 3 4 5 6 7], cardList length equal to cardNum: 8 [INFO] 2025/11/25 22:46:59.250267 1 devmanager/devmanager.go:171 the dcmi version is 25.5.0.b030 [INFO] 2025/11/25 22:46:59.250405 1 devmanager/devmanager.go:235 chipName: Ascend910, devType: Ascend910A3 ...
If the following information is displayed, the component is running properly:
... [INFO] 2025/11/25 22:46:59.289352 1 devmgr/workflow.go:57 init module <hwDev manager> success [INFO] 2025/11/25 22:46:59.293773 1 app/config.go:40 load fault config from faultCode.json success [INFO] 2025/11/25 22:46:59.293866 1 app/workflow.go:50 init module <fault manager> success [INFO] 2025/11/25 22:46:59.293901 1 app/workflow.go:76 init module <container controller> success [INFO] 2025/11/25 22:46:59.293930 1 app/workflow.go:64 init module <reset-manager> success [INFO] 2025/11/25 22:46:59.315101 378 devmgr/hwdevmgr.go:365 subscribe device fault event success ...
Parent topic: Confirming Component Status