Container Recovery
When Container Manager detects a RestartRequest, RestartBusiness, FreeRestartNPU, or RestartNPU fault processor fault, it stops and recovers the container according to the policy specified by -ctrStrategy in the run command. For more details, see Table 1.
The container status changes during its startup and stopping.
- When the container is being stopped, the container status is pausing. When this status lasts for more than 30s, the container description queried by the status command is "Container pause may fail. Please manually delete the container".
- After the container is stopped, the container status changes to paused. When this status lasts for more than 400s, the container description queried by the status command is "Device hot reset may fail. Please check of device status and recovery are required".
- When the container is being resumed, the container status is resuming. When this status lasts for more than 30s, the container description queried by the status command is "The device has been recovered, but the container failed to be resumed. Please manually pull up the container".
- In other cases, the container status is running with the description "normal". The start time queried by the status command corresponds either to when Container Manager detects the container startup or to when the container is recovered.
- Container Manager recovers only the containers that are stopped by itself.
- The preceding container status is customized by Container Manager and is not officially defined by the container runtime.
- If containerd is used and the container task does not exist, the container fails to be stopped.
Parent topic: NPU Hardware Fault Detection and Rectification