Container Manager
Application Scenario
If no Kubernetes is available, when the inference or training process is abnormal, Volcano and Ascend Device Plugin cannot be used to stop and reschedule service containers, isolate faulty nodes, or reset NPUs. Container Manager is then provided to manage containers and reset NPUs when Kubernetes is not used.
Component Function
- Subscribe to processor fault information from the driver, and store the processor status and fault information in the cache for subsequent container management and processor reset.
- Configure fault handling levels.
- Perform a hot reset on an idle, faulty processor. The processor can be recoverable after a restart.
- If the faulty processor is being used by a container, stop the container according to the user's startup configuration. After the faulty processor is reset, restart the container.
Upstream and Downstream Dependencies
Figure 1 Upstream and downstream dependencies


- Obtain the processor type, quantity, and health status from the DCMI.
- Deliver a processor reset command to the DCMI.
- Obtain the information about the running containers and mounted processors from the container runtime Docker or containerd.
- Deliver the commands for stopping and starting containers to the container runtime.
Parent topic: Component Description