Container Manager

Perform the following steps on the node where Container Manager is deployed to verify its installation status:

  1. Check the component service status, which should be active (running).
    systemctl status container-manager.service

    Command output:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
     container-manager.service - Ascend container manager
         Loaded: loaded (/etc/systemd/system/container-manager.service; disabled; vendor preset: enabled)
         Active: active (running) since Wed 2025-11-26 20:56:50 UTC; 16s ago
        Process: 41459 ExecStart=/bin/bash -c container-manager run  -ctrStrategy ringRecover -logPath=/var/log/mindx-dl/container-manager/container-manager.log >/dev/null 2>&1 & (code=exited, status=0/SUCCESS)
       Main PID: 41464 (container-manag)
          Tasks: 10 (limit: 629145)
         Memory: 13.3M
         CGroup: /system.slice/container-manager.service
                 └─41464 /home/container-manager/container-manager run -ctrStrategy ringRecover
    ...
    

    If information similar to the following is displayed, ignore it. The possible cause is that the IP address and subnet mask of the RoCE NIC are not configured.

    [dsmi_common_interface.c:1017][ascend][curpid:244135,244135][drv][dmp][dsmi_get_device_ip_address]devid 0 dsmi_cmd_get_device_ip_address return 1 error!
  2. View component logs.
    cat /var/log/mindx-dl/container-manager/container-manager.log

    Command output (Atlas 800I A3 SuperPoD Server as an example):

    1
    2
    3
    4
    5
    6
    7
    [INFO]     2025/11/25 22:46:59.007163 1       hwlog/api.go:108    container-manager.log's logger init success
    [INFO]     2025/11/25 22:46:59.007288 1       command/run.go:150    init log success
    [INFO]     2025/11/25 22:46:59.007506 1       devmanager/devmanager.go:134    get card list from dcmi reset timeout is 60
    [INFO]     2025/11/25 22:46:59.250103 1       devmanager/devmanager.go:142    deviceManager get cardList is [0 1 2 3 4 5 6 7], cardList length equal to cardNum: 8
    [INFO]     2025/11/25 22:46:59.250267 1       devmanager/devmanager.go:171    the dcmi version is 25.5.0.b030
    [INFO]     2025/11/25 22:46:59.250405 1       devmanager/devmanager.go:235    chipName: Ascend910, devType: Ascend910A3
    ...
    

    If the following information is displayed, the component is running properly:

    ...
    [INFO]     2025/11/25 22:46:59.289352 1       devmgr/workflow.go:57    init module <hwDev manager> success
    [INFO]     2025/11/25 22:46:59.293773 1       app/config.go:40    load fault config from faultCode.json success
    [INFO]     2025/11/25 22:46:59.293866 1       app/workflow.go:50    init module <fault manager> success
    [INFO]     2025/11/25 22:46:59.293901 1       app/workflow.go:76    init module <container controller> success
    [INFO]     2025/11/25 22:46:59.293930 1       app/workflow.go:64    init module <reset-manager> success
    [INFO]     2025/11/25 22:46:59.315101 378     devmgr/hwdevmgr.go:365    subscribe device fault event success
    ...