Container Recovery

Function Highlights

In the scenario where Kubernetes is not deployed, if the training or inference process is abnormal, you can configure the container recovery function to recover containers.

  • Fault detection: Container Manager is used to detect job faults.
  • Fault rectification: After a fault occurs, the faulty device can be automatically recovered without manual intervention.
  • Container recovery: When a fault occurs, the container is stopped. After the fault is rectified, the container is started again.

Required Component

Container Manager

Instructions

  1. Refer to Installation and Deployment for component installation.
  2. Refer to Appliance Feature Guide for feature usage.