(Optional) Result Viewing of Fault Recovery of the Inference Card

When an NPU is faulty, Volcano automatically schedules inference jobs running on the NPU to other nodes. (Other schedulers do not support this function, which needs to be implemented by users.) Then, Ascend Device Plugin resets the NPU to make its healthy again. You can run the npu-smi info command to view NPU information. If the health field of the faulty NPU is OK, the NPU has recovered.

To use Ascend Device Plugin to reset NPUs, ensure that no inference job exists on a faulty NPU or inference jobs have been scheduled. If you use another scheduler that does not support rescheduling, you can manually delete inference jobs on the NPU.