A Job Is Pending Due to Insufficient Volcano Resources

Symptom

When Volcano is used to schedule jobs, if the applied resources are greater than the available resources in the environment, the jobs are not scheduled and the job status is pending. When the applied resources exceed the upper limit of the cluster resources, jobs are always in the pending status.

The following example uses job mindx-dls-npu-16p in the custom namespace mindx-test. (This job applies for 16 NPUs when the cluster has only 8 NPUs.)

Run the following command to view the job details:

kubectl describe vcjob -n mindx-test mindx-dls-npu-16p

root@ubuntu:/home/yaml# kubectl describe vcjob -n mindx-test   mindx-dls-npu-16p
Name:         mindx-dls-npu-16p
Namespace:    mindx-test
Labels:       ring-controller.atlas=ascend-910
...
  Min Available:  2
  ...
    Replicas:  2
    ...
          Resources:
            Limits:
              Cpu:                   10
              huawei.com/Ascend910:  8
              Memory:                20Gi
            Requests:
              Cpu:                   10
              huawei.com/Ascend910:  8
              Memory:                20Gi
...
Status:
  Controlled Resources:
    Plugin - Env:  env
    Plugin - Ssh:  ssh
    Plugin - Svc:  svc
  Min Available:   2
  State:
    Last Transition Time:  2021-02-09T07:38:04Z
    Phase:                 Pending
Events:                    <none>

The preceding information indicates that the job is in the pending state.

Possible Causes

Resources are insufficient and volcano-scheduler does not terminate job scheduling.

Solution

  1. Ensure that resources are sufficient before using.
  2. Run the kubectl get vcjob -n namespace command to find the job if the job has been delivered and is in the pending state.
    root@ubuntu:/home/yaml# kubectl get vcjob -n mindx-test
    NAME                AGE
    mindx-dls-npu-16p   6m10s
  3. Run the kubectl delete vcjob mindx-dls-npu-16p -n namespace command to delete vcjob.
    root@ubuntu:/home/yaml# kubectl delete vcjob mindx-dls-npu-16p -n mindx-test
    job.batch.volcano.sh "mindx-dls-npu-16p" deleted
    • mindx-dls-npu-16p indicates the job name of vcjob.
    • mindx-test indicates the name of the namespace to which the job belongs.