Feature Description
If Kubernetes is not deployed, abnormal training or inference processes cannot be effectively recovered. To address this limitation, enable the container recovery feature. Therefore, you need to install Container Manager. For details about how to install Container Manager, see Installation and Deployment.
Function |
Description |
Principles and Configuration Procedure |
|---|---|---|
Fault detection |
More than 350 hardware faults can be detected in real time. |
|
Fault handling |
The faulty device can be automatically recovered without manual intervention if a fault at the RestartRequestCodes, RestartBusinessCodes, FreeRestartNPUCodes, or RestartNPUCodes level occurs. |
|
Container recovery |
You are allowed to configure container start and stopping policies. If a fault at the RestartRequestCodes, RestartBusinessCodes, FreeRestartNPUCodes, or RestartNPUCodes level occurs, the container is stopped and then restarted after the fault is rectified. |
This feature does not apply to computing power virtualization and does not support shared devices or mixed insertion mode.