使用Volcano进行任务调度时,若申请的资源大于当前环境可用资源,任务不会被调度,同时状态会置为pending。若申请的资源大于集群所拥有的资源上限时,任务则会一直处于pending状态。
以自定义命名空间mindx-test的“mindx-dls-npu-16p”任务为例进行说明(该任务在集群只有8个NPU的情况下,申请16个NPU)。
查看任务详情:
kubectl describe vcjob -n mindx-test mindx-dls-npu-16p
root@ubuntu:/home/yaml# kubectl describe vcjob -n mindx-test mindx-dls-npu-16p Name: mindx-dls-npu-16p Namespace: mindx-test Labels: ring-controller.atlas=ascend-910 ... Min Available: 2 ... Replicas: 2 ... Resources: Limits: Cpu: 10 huawei.com/Ascend910: 8 Memory: 20Gi Requests: Cpu: 10 huawei.com/Ascend910: 8 Memory: 20Gi ... Status: Controlled Resources: Plugin - Env: env Plugin - Ssh: ssh Plugin - Svc: svc Min Available: 2 State: Last Transition Time: 2021-02-09T07:38:04Z Phase: Pending Events: <none>
由此可见该任务处于pending状态。
资源不够,volcano-scheduler未将任务终止。
root@ubuntu:/home/yaml# kubectl get vcjob -n mindx-test NAME AGE mindx-dls-npu-16p 6m10s
root@ubuntu:/home/yaml# kubectl delete vcjob mindx-dls-npu-16p -n mindx-test job.batch.volcano.sh "mindx-dls-npu-16p" deleted