Volcano主要负责:
整个问题的分析过程如下:
在该图中,问题处理的逻辑如下:
kubectl get pod --all-namespaces
查看示例如下:
root@ubuntu:/home/yaml# kubectl get pod --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE ... kube-system kube-scheduler-ubuntu 1/1 Running 184 12d vcjob mindx-dls-2p-default-2p-0 0/1 Pending 0 37s ...
在1中找到Pod的名称,本例中为mindx-dls-2p-default-2p-0,其状态为Pending。
... vcjob mindx-dls-2p-default-2p-0 0/1 Pending 0 3m43s ...
kubectl get vcjob --all-namespaces
查看示例如下,示例vcjob的名字为“mindx-dls-2p”:
root@ubuntu:/home/yaml# kubectl get vcjob --all-namespaces NAMESPACE NAME AGE vcjob mindx-dls-2p 77s
根据3中找到vcjob的任务名称,本例子中为“mindx-dls-2p”,使用如下命令查看vcjob任务的详情:
kubectl describe vcjob mindx-dls-2p -n vcjob
查看vcjob的event部分的示例如下。
root@ubuntu:/home/yaml# kubectl describe vcjob mindx-dls-2p -n vcjob Name: mindx-dls-2p Namespace: vcjob Labels: ring-controller.atlas=ascend-910 Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"ring-controller.atlas":"ascend-910"},"name"... API Version: batch.volcano.sh/v1alpha1 Kind: Job ... Status: Controlled Resources: Plugin - Env: env Plugin - Ssh: ssh Plugin - Svc: svc Min Available: 1 Pending: 1 State: Last Transition Time: 2020-12-23T21:22:04Z Phase: Pending Events: <none>
如果无event详情,则需要执行5查看vcjob对应Pod的详情。
在1中找到Pod的名称,本例中为mindx-dls-2p-default-2p-0,其状态为Pending。
kubectl describe pod mindx-dls-2p-default-2p-0 -n vcjob
root@ubuntu:/home/yaml# kubectl describe pod mindx-dls-2p-default-2p-0 -n vcjob Name: mindx-dls-2p-default-2p-0 Namespace: vcjob Priority: 0 Node: <none> Labels: app=tf ring-controller.atlas=ascend-910 volcano.sh/job-name=mindx-dls-2p volcano.sh/job-namespace=vcjob Annotations: scheduling.k8s.io/group-name: mindx-dls-2p volcano.sh/job-name: mindx-dls-2p volcano.sh/job-version: 0 volcano.sh/task-spec: default-2p Status: Pending IP: IPs: <none> Controlled By: Job/mindx-dls-2p ... Node-Selectors: host-arch=huawei-arm Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 5m3s volcano all nodes are unavailable: 2 selector(host-arch) not equal: task(huawei-arm) node(huawei-x86) conf(huawei-arm|huawei-x86) .
如果也不能确定问题,则需执行6查看日志排查原因。
本例中event中已说明问题为任务的selector与配置不一致导致无法分配。
kubectl get pod --all-namespaces -o wide
找到volcano-scheduler的节点,查看其打印内容进行故障的进一步排查,在训练任务处于pending状态,原因:nodes are unavailable章节提供了一些常见的问题处理方法。