A Job Is Pending Due to Insufficient Volcano Resources
Symptom
When Volcano is used to schedule jobs, if the applied resources are greater than the available resources in the environment, the jobs are not scheduled and the job status is pending. When the applied resources exceed the upper limit of the cluster resources, jobs are always in the pending status.
The following example uses job mindx-dls-npu-16p in the custom namespace mindx-test. (This job applies for 16 NPUs when the cluster has only 8 NPUs.)
Run the following command to view the job details:
kubectl describe vcjob -n mindx-test mindx-dls-npu-16p
root@ubuntu:/home/yaml# kubectl describe vcjob -n mindx-test mindx-dls-npu-16p
Name: mindx-dls-npu-16p
Namespace: mindx-test
Labels: ring-controller.atlas=ascend-910
...
Min Available: 2
...
Replicas: 2
...
Resources:
Limits:
Cpu: 10
huawei.com/Ascend910: 8
Memory: 20Gi
Requests:
Cpu: 10
huawei.com/Ascend910: 8
Memory: 20Gi
...
Status:
Controlled Resources:
Plugin - Env: env
Plugin - Ssh: ssh
Plugin - Svc: svc
Min Available: 2
State:
Last Transition Time: 2021-02-09T07:38:04Z
Phase: Pending
Events: <none>
The preceding information indicates that the job is in the pending state.
Possible Causes
Resources are insufficient and volcano-scheduler does not terminate job scheduling.
Solution
- Ensure that resources are sufficient before using.
- Run the kubectl get vcjob -n namespace command to find the job if the job has been delivered and is in the pending state.
root@ubuntu:/home/yaml# kubectl get vcjob -n mindx-test NAME AGE mindx-dls-npu-16p 6m10s
- Run the kubectl delete vcjob mindx-dls-npu-16p -n namespace command to delete vcjob.
root@ubuntu:/home/yaml# kubectl delete vcjob mindx-dls-npu-16p -n mindx-test job.batch.volcano.sh "mindx-dls-npu-16p" deleted
- mindx-dls-npu-16p indicates the job name of vcjob.
- mindx-test indicates the name of the namespace to which the job belongs.
Parent topic: Training Job