Feature Description

If Kubernetes is not deployed, abnormal training or inference processes cannot be effectively recovered. To address this limitation, enable the container recovery feature. Therefore, you need to install Container Manager. For details about how to install Container Manager, see Installation and Deployment.

Function

Description

Principles and Configuration Procedure

Fault detection

More than 350 hardware faults can be detected in real time.

Fault Detection

Fault handling

The faulty device can be automatically recovered without manual intervention if a fault at the RestartRequestCodes, RestartBusinessCodes, FreeRestartNPUCodes, or RestartNPUCodes level occurs.

Fault Handling

Container recovery

You are allowed to configure container start and stopping policies. If a fault at the RestartRequestCodes, RestartBusinessCodes, FreeRestartNPUCodes, or RestartNPUCodes level occurs, the container is stopped and then restarted after the fault is rectified.

Container Recovery

This feature does not apply to computing power virtualization and does not support shared devices or mixed insertion mode.