Fault Locating
Volcano is responsible for:
- Managing the life cycle of vcjob and the corresponding pod.
- Allocating pod resources and scheduling pods.
The following figure shows how to locate the fault.

In this figure, the problem handling logic is as follows:
- Check whether the pod exists.
kubectl get pod --all-namespaces
- If the pod does not exist, run the 3 command to check whether vcjob exists.
- If the pod exists, run the 2 command to check the pod status.
The following is an example:
root@ubuntu:/home/yaml# kubectl get pod --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE ... kube-system kube-scheduler-ubuntu 1/1 Running 184 12d vcjob mindx-dls-2p-default-2p-0 0/1 Pending 0 37s ...
- Check the pod status.
Find the pod name in 1. In this example, the pod name is mindx-dls-2p-default-2p-0 and its status is Pending.
... vcjob mindx-dls-2p-default-2p-0 0/1 Pending 0 3m43s ...
- Run the following command to check whether vcjob exists:
kubectl get vcjob --all-namespaces
For example, the name of vcjob is mindx-dls-2p.
root@ubuntu:/home/yaml# kubectl get vcjob --all-namespaces NAMESPACE NAME AGE vcjob mindx-dls-2p 77s
- If vcjob does not exist, the job is not created. If the problem is not caused by Volcano, check whether the job delivery is normal.
- If the vcjob exists, run the 3 command to view the vcjob details.
- View the vcjob details.
Based on the vcjob name found in 3, which is mindx-dls-2p in this example, run the following command to view the details of the vcjob:
kubectl describe vcjob mindx-dls-2p -n vcjob
The following is an example of viewing the event part of vcjob:
root@ubuntu:/home/yaml# kubectl describe vcjob mindx-dls-2p -n vcjob Name: mindx-dls-2p Namespace: vcjob Labels: ring-controller.atlas=ascend-910 Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"ring-controller.atlas":"ascend-910"},"name"... API Version: batch.volcano.sh/v1alpha1 Kind: Job ... Status: Controlled Resources: Plugin - Env: env Plugin - Ssh: ssh Plugin - Svc: svc Min Available: 1 Pending: 1 State: Last Transition Time: 2020-12-23T21:22:04Z Phase: Pending Events: <none>If no event detail is displayed, run the 5 command to view details about the pod corresponding to vcjob.
- View pod details.
Find the pod name in 1. In this example, the pod name is mindx-dls-2p-default-2p-0 and its status is Pending.
kubectl describe pod mindx-dls-2p-default-2p-0 -n vcjob
The following is an example of viewing the event part of the pod of vcjob:root@ubuntu:/home/yaml# kubectl describe pod mindx-dls-2p-default-2p-0 -n vcjob Name: mindx-dls-2p-default-2p-0 Namespace: vcjob Priority: 0 Node: <none> Labels: app=tf ring-controller.atlas=ascend-910 volcano.sh/job-name=mindx-dls-2p volcano.sh/job-namespace=vcjob Annotations: scheduling.k8s.io/group-name: mindx-dls-2p volcano.sh/job-name: mindx-dls-2p volcano.sh/job-version: 0 volcano.sh/task-spec: default-2p Status: Pending IP: IPs: <none> Controlled By: Job/mindx-dls-2p ... Node-Selectors: host-arch=huawei-arm Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 5m3s volcano all nodes are unavailable: 2 selector(host-arch) not equal: task(huawei-arm) node(huawei-x86) conf(huawei-arm|huawei-x86) .If the fault cannot be located, run the 6 command to view logs and locate the fault.
In this example, the event indicates that the job selector is inconsistent with the configuration. As a result, the allocation fails.
- View Volcano logs.
kubectl get pod --all-namespaces -o wide
Locate the node of volcano-scheduler and view the printed information to further locate the fault. For details about how to troubleshoot common faults, see Training Job Is in the Pending State Because "nodes are unavailable".