Rescheduling Upon Inference Card Faults

Function Highlights

If an inference NPU resource managed by the cluster scheduling components is faulty, the faulty resource is isolated and automatically rescheduled.

Required Components

  • Ascend Device Plugin
  • Ascend Docker Runtime
  • Ascend Operator
  • Volcano
  • ClusterD
  • NodeD

Instructions