Container Manager

Container Manager runs on a physical machine in binary mode.

  1. Log in to the server as the root user.
  2. Upload the obtained Container Manager package to any directory on the server (for example, /home/container-manager).
  3. Go to the /home/container-manager directory and decompress the package.
    unzip Ascend-mindxdl-container-manager_{version}_linux-{arch}.zip

    <version> indicates the package version, and <arch> indicates the CPU architecture.

  4. (Optional) Create a custom fault code configuration file and customize the fault processing level by referring to (Optional) Processor Fault Level Configuration. The following steps do not include this file.
  5. Create and edit the container-manager.service file.
    1. Run the following commands to create container-manager.service:
      vi container-manager.service
    2. Write the following information to container-manager.service. The content in bold in the ExecStart field is the startup command. For details about the startup parameters, see Table 1. You can modify the parameters as required.

      [Unit]
      Description=Ascend container manager
      Documentation=hiascend.com
      
      [Service]
      ExecStart=/bin/bash -c "container-manager run -ctrStrategy ringRecover -logPath=/var/log/mindx-dl/container-manager/container-manager.log >/dev/null  2>&1 &"
      Restart=always
      RestartSec=2
      KillMode=process
      Environment="GOGC=50"
      Environment="GOMAXPROCS=2"
      Environment="GODEBUG=madvdontneed=1"
      Type=forking
      User=root
      Group=root
      
      [Install]
      WantedBy=multi-user.target
    3. Press Esc and enter :wq! to save the settings and exit.
  6. Create and edit the container-manager.timer file. Configuring a timer to start Container Manager after a delay can ensure that the NPU is ready when Container Manager is started.
    1. Run the following commands to create container-manager.timer:
      vi container-manager.timer
    2. Write the following information into container-manager.timer.
      [Unit]
      Description=Timer for container manager Service
      
      [Timer]
      # Set a delay for starting Container Manager. Adjust the time as required.
      OnBootSec=60s 
      Unit=container-manager.service
      
      [Install]
      WantedBy=timers.target
    3. Press Esc and enter :wq! to save the settings and exit.
  7. Run the following commands to restart the Container Manager service:
    # Set the Container Manager binary file path.
    cp container-manager /usr/local/bin
    chmod 500 /usr/local/bin/container-manager
    
    # Prepare the Container Manager service file.
    cp container-manager.service /etc/systemd/system
    cp container-manager.timer /etc/systemd/system      
    
    # Start the Container Manager service.
    systemctl enable container-manager.service 
    systemctl enable container-manager.timer 
    systemctl start container-manager.service
    systemctl start container-manager.timer

Parameter Description

Table 1 Container Manager startup parameters

Command

Parameter

Type

Default Value

Description

help

-

-

-

Help information.

version

-

-

-

Container Manager version information.

run

-logPath

String

/var/log/mindx-dl/container-manager/container-manager.log

Log file.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. The name of the dumped file is in the format of "container-manager-dump triggering time.log", for example, container-manager-2025-11-07T03-38-24.402.log.

-logLevel

Integer

0

Log level.

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

Integer

7

Log backup duration, in days. The value range is [7, 700].

-maxBackups

Integer

30

Maximum number of log files that can be retained after dump. The value range is (0, 30].

-ctrStrategy

String

never

Faulty container startup/stopping strategy:

  • never: The container is not started or stopped.
  • singleRecover: Only the container with a faulty processor mounted is started or stopped. When a fault occurs, the container is stopped. After the fault is rectified, the container is started again.
  • ringRecover: Only the container that has all processors associated with the faulty processor mounted is started or stopped. When a fault occurs, the container is stopped. After the fault is rectified, the container is started again.
NOTE:
  • Container Manager starts or stops the container only when detecting that the processor is in the RestartRequest, RestartBusiness, FreeRestartNPU, or RestartNPU status. For details about the fault types, see Fault Code Level Description.
  • If singleRecover or ringRecover is configured, the container cannot be automatically restarted via container runtime.
  • If the container is stopped manually, the memory data of Container Manager may be disordered, and the container status may be abnormal.

-sockPath

String

/run/docker.sock

Socket file of the container runtime. The path cannot be a soft link.

-runtimeType

String

docker

Container runtime type.

  • docker
  • containerd
    NOTE:
    • Container Manager can manage only containers started by one container runtime.
    • When containerd is used as the runtime, only containers not in the moby namespace can be managed. Containers with identical names across namespaces may cause abnormal management behavior.

-faultConfigPath

String

""

Path of the custom fault configuration file. If it is not configured, the default fault code configuration is used. For more details, see Fault Level Configuration.

NOTE:
  • The path cannot be a soft link.
  • The file permission cannot be higher than 640.

status

-

-

-

Container recovery progress, including the container ID, status, time when the status starts, and status description. For details about the container status definition and change rules, see Container Recovery.

NOTE:

If the container information queried by status is incorrect, check whether the run service has been stopped or whether more than one Container Manager is started in the environment.

If you need to modify the startup parameters of Container Manager after it is started, modify the startup parameters in the service configuration file and run the following command to restart Container Manager:

systemctl daemon-reload && systemctl restart container-manager