Recovery of Inference Card Faults

Function Highlights

If an inference NPU resource managed by the cluster scheduling components is faulty, a hot reset is performed on the faulty resource to restore the NPU.

Required Components

  • Volcano or other schedulers
  • Ascend Device Plugin
  • Ascend Docker Runtime
  • ClusterD
  • NodeD

Instructions