昇腾社区首页
中文
注册
开发者
下载

升级Container Manager

在物理机上直接替换Container Manager二进制升级组件。

  1. 以root用户登录Container Manager组件部署的节点。
  2. 将获取到的Container Manager软件包上传至服务器的任意目录(如“/tmp/container-manager”)。
  3. 进入“/tmp/container-manager”目录并进行解压操作。
    unzip Ascend-mindxdl-container-manager_{version}_linux-{arch}.zip

    <version>为软件包的版本号;<arch>为CPU架构。

  4. 依次执行以下命令,升级Container Manager组件。
    # 停止Container Manager系统服务,并删除对应Container Manager二进制文件
    systemctl stop container-manager.service
    chattr -i /usr/local/bin/container-manager
    rm -f /usr/local/bin/container-manager
    
    # 从解压文件中获取新二进制文件,替换旧Container Manager二进制文件
    cp /tmp/container-manager/container-manager /usr/local/bin
    chmod 500 /usr/local/bin/container-manager
    
    # 重启Container Manager系统服务
    systemctl daemon-reload
    systemctl start container-manager.service
  5. 验证Container Manager组件的升级状态。
    1. 查看组件服务的状态,状态应为active (running)。
      systemctl status container-manager.service

      回显示例:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
       container-manager.service - Ascend container manager
           Loaded: loaded (/etc/systemd/system/container-manager.service; disabled; vendor preset: enabled)
           Active: active (running) since Wed 2025-11-26 20:56:50 UTC; 16s ago
          Process: 41459 ExecStart=/bin/bash -c container-manager run  -ctrStrategy ringRecover -logPath=/var/log/mindx-dl/container-manager/container-manager.log >/dev/null 2>&1 & (code=exited, status=0/SUCCESS)
         Main PID: 41464 (container-manag)
            Tasks: 10 (limit: 629145)
           Memory: 13.3M
           CGroup: /system.slice/container-manager.service
                   └─41464 /home/container-manager/container-manager run -ctrStrategy ringRecover
      ...
      
    2. 查看组件日志。
      cat /var/log/mindx-dl/container-manager/container-manager.log

      回显以Atlas 800I A3 超节点服务器为例:

      1
      2
      3
      4
      5
      6
      7
      [INFO]     2025/11/25 22:46:59.007163 1       hwlog/api.go:108    container-manager.log's logger init success
      [INFO]     2025/11/25 22:46:59.007288 1       command/run.go:150    init log success
      [INFO]     2025/11/25 22:46:59.007506 1       devmanager/devmanager.go:134    get card list from dcmi reset timeout is 60
      [INFO]     2025/11/25 22:46:59.250103 1       devmanager/devmanager.go:142    deviceManager get cardList is [0 1 2 3 4 5 6 7], cardList length equal to cardNum: 8
      [INFO]     2025/11/25 22:46:59.250267 1       devmanager/devmanager.go:171    the dcmi version is 25.5.0.b030
      [INFO]     2025/11/25 22:46:59.250405 1       devmanager/devmanager.go:235    chipName: Ascend910, devType: Ascend910A3
      ...
      

      如果出现如下打印信息,表示组件运行正常。

      ...
      [INFO]     2025/11/25 22:46:59.289352 1       devmgr/workflow.go:57    init module <hwDev manager> success
      [INFO]     2025/11/25 22:46:59.293773 1       app/config.go:40    load fault config from /home/faultCode.json success
      [INFO]     2025/11/25 22:46:59.293866 1       app/workflow.go:50    init module <fault manager> success
      [INFO]     2025/11/25 22:46:59.293901 1       app/workflow.go:76    init module <container controller> success
      [INFO]     2025/11/25 22:46:59.293930 1       app/workflow.go:64    init module <reset-manager> success
      [INFO]     2025/11/25 22:46:59.315101 378     devmgr/hwdevmgr.go:365    subscribe device fault event success
      ...