通过Volcano创建的训练任务可能会出现资源、标签和模型等不满足要求的情况,将导致vcjob任务、pod处于pending或failed状态。
kubectl get vcjob --all-namespaces
root@ubuntu:~# kubectl get vcjob --all-namespaces NAMESPACE NAME AGE xxx-test mindx-dls-npu-16p 42s
kubectl describe vcjob -n xxx-test mindx-dls-npu-16p
其中:
root@ubuntu:~# kubectl describe vcjob -n xxx-test mindx-dls-npu-16p Name: mindx-dls-npu-16p Namespace: xxx-test Labels: ring-controller.atlas=ascend-910 Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"ring-controller.atlas":"ascend-910"},"name"... API Version: batch.volcano.sh/v1alpha1 Kind: Job ... Status: Min Available: 2 State: Last Transition Time: 2021-03-09T02:43:04Z Phase: Failed Reason: all nodes are unavailable: 1 node(s) resource fit failed, 1 task(mindx-dls-npu-16p-default-1p-1) in node(ubuntu-65-157):no matching label on this node key[host-arch] : task(huawei-x86) node(huawei-arm) conf(huawei-arm|huawei-x86). Version: 3 Events: <none>
如上所示,vcjob(mindx-dls-npu-16p)失败。Reason中描述了失败原因。
kubectl get pod --all-namespaces
root@ubuntu:~# kubectl get pod --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE xxx-test mindx-dls-npu-1-1p-default-1p-0 0/1 Pending 0 2m18s xxx-test mindx-dls-npu-1p-default-1p-0 1/1 Running 0 2m40s xxx-test mindx-dls-npu-2p-default-1p-0 1/1 Running 0 2m42s xxx-test mindx-dls-npu-4p-default-1p-0 1/1 Running 0 2m45s xxx-test mindx-dls-npu-8p-default-1p-0 1/1 Running 0 2m49s ... npu-exporter npu-exporter-jwq5l 1/1 Running 0 9h vcjob mindx-dls-test-default-test-0 1/1 Running 0 5m3s volcano-system volcano-controllers-68769b787f-2wgw7 1/1 Running 0 15h volcano-system volcano-scheduler-768ddcd774-f9w5w 1/1 Running 0 15h
kubectl describe pod -n xxx-test mindx-dls-npu-1-1p-default-1p-0
其中:
root@ubuntu:~# kubectl describe pod -n xxx-test mindx-dls-npu-1-1p-default-1p-0 Name: mindx-dls-npu-1-1p-default-1p-0 Namespace: xxx-test Priority: 0 Node: <none> Labels: app=tf ring-controller.atlas=ascend-910 volcano.sh/job-name=mindx-dls-npu-1-1p volcano.sh/job-namespace=xxx-test Annotations: scheduling.k8s.io/group-name: mindx-dls-npu-1-1p volcano.sh/job-name: mindx-dls-npu-1-1p volcano.sh/job-version: 0 volcano.sh/task-spec: default-1p Status: Pending ... Node-Selectors: host-arch=huawei-x86 Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 103s volcano all nodes are unavailable: 2 node(s) resource fit failed.
如上所示,pod(mindx-dls-npu-1-1p-default-1p-0)处于pending状态。Events中描述了pending原因。