When the Configuration Is Correct, Rescheduling Cannot Be Triggered due to Processor Faults

Symptom

When the rescheduling feature is correctly configured, it sometimes cannot be triggered due to processor faults.

Possible Causes

The cause is the problem of open-source software. The number of NPUs, which are used by the pod to be restarted on the node, is greater than that of allocatable NPUs on the node (the number is subtracted after the processor is faulty and leads to allocation failure). As a result, the node is set to notReady in the cache of volcano-scheduler, and cannot be passed to Ascend Volcano Plugin. As a result, the rescheduling feature cannot be triggered.

Solution

There is a low probability that this problem occurs. If it occurs, perform the following operations:

Method 1: Manually delete the pod. Log in to the system background and run kubectl delete pod -n pod_name_in_its_namespace to delete the pod.

Method 2: Deliver the job again.

Parent topic: Inference Job