When the Configuration Is Correct, Rescheduling Cannot Be Triggered due to NPU Processor Faults

Symptom

When the rescheduling feature is correctly configured, it sometimes cannot be triggered due to processor faults.

Cause Analysis

This issue is caused by the open-source software. The number of NPUs, which are used by the pod to be restarted on the node, is greater than that of allocatable NPUs on the node (That is because an NPU cannot be allocated when it becomes faulty.). As a result, the node is set to notReady in the cache of volcano-scheduler, and cannot be passed to ascend-volcano-plugin. As a result, the rescheduling feature cannot be triggered.

Solution

There is a low probability that this problem occurs. If it occurs, perform the following operations:

Method 1: Manually delete a pod. Log in to the system background and run the kubectl delete pod -n Namespace pod name command to delete the pod.

Method 2: Deliver the job again.

Parent topic: Faults During Use