A Job Is Pending Due to Insufficient Resources

Symptom

When Volcano is used to schedule jobs, if the applied resources are greater than the available resources in the environment, the jobs are not scheduled and the job status is Pending. When the allocated resources exceed the upper limit of the cluster resources, jobs will always be in the Pending status.

The following uses job mindx-dls-npu-16p in the custom namespace mindx-test as an example. (This job allocates 16 NPUs when the cluster has only 8 NPUs.)

Run the following command to view the job details:
kubectl describe vcjob -n mindx-test mindx-dls-npu-16p

Command output:

Name:         mindx-dls-npu-16p
Namespace:    mindx-test
Labels:       ring-controller.atlas=ascend-910
...
  Min Available:  2
  ...
    Replicas:  2
    ...
          Resources:
            Limits:
              Cpu:                   10
              huawei.com/Ascend910:  8
              Memory:                20Gi
            Requests:
              Cpu:                   10
              huawei.com/Ascend910:  8
              Memory:                20Gi
...
Status:
  Controlled Resources:
    Plugin - Env:  env
    Plugin - Ssh:  ssh
    Plugin - Svc:  svc
  Min Available:   2
  State:
    Last Transition Time:  2021-02-09T07:38:04Z
    Phase:                 Pending
Events:                    <none>

The preceding information indicates that the job is in the Pending status.

Cause Analysis

Resources are insufficient and volcano-scheduler does not terminate job scheduling.

Solution

  1. Ensure that resources are sufficient before allocating resources.
  2. Run the kubectl get vcjob -n namespace command to find the job if the job has been delivered and is in the Pending status.
    root@ubuntu:/home/yaml# kubectl get vcjob -n mindx-test
    NAME                AGE
    mindx-dls-npu-16p   6m10s
  3. Run the kubectl delete vcjob mindx-dls-npu-16p -n namespace command to delete vcjob.
    root@ubuntu:/home/yaml# kubectl delete vcjob mindx-dls-npu-16p -n mindx-test
    job.batch.volcano.sh "mindx-dls-npu-16p" deleted
    • mindx-dls-npu-16p indicates the job name of vcjob.
    • mindx-test indicates the name of the namespace to which the job belongs.