Recovery of Inference Card Faults
Function Highlights
If an inference NPU resource managed by the cluster scheduling components is faulty, a hot reset is performed on the faulty resource to restore the NPU.
Required Components
- Volcano or other schedulers
- Ascend Device Plugin
- Ascend Docker Runtime
- ClusterD
- NodeD
Instructions
- Refer to Installation and Deployment for component installation.
- Refer to Recovery of Inference Card Faults for feature usage.
Parent topic: Basic Scheduling