Fault Locating

Volcano is responsible for:

Managing the life cycle of vcjob and the corresponding pod.
Allocating pod resources and scheduling pods.

The following figure shows how to locate the fault.

In this figure, the problem handling logic is as follows:

Check whether the pod exists.

kubectl get pod --all-namespaces

If the pod does not exist, run the 3 command to check whether vcjob exists.
If the pod exists, run the 2 command to check the pod status.

The following is an example:

root@ubuntu:/home/yaml# kubectl get pod --all-namespaces
NAMESPACE        NAME                                       READY   STATUS      RESTARTS   AGE
...
kube-system      kube-scheduler-ubuntu                      1/1     Running     184        12d
vcjob            mindx-dls-2p-default-2p-0                  0/1     Pending     0          37s
...

Check the pod status.
Find the pod name in 1. In this example, the pod name is mindx-dls-2p-default-2p-0 and its status is Pending.
```
...
vcjob            mindx-dls-2p-default-2p-0                  0/1     Pending     0          3m43s
...
```
- If the pod status is Pending, generally, resources or scheduling conditions are not met. You can run the 6 command to view the detailed cause or run the 5 command to view the pod details.
Run the following command to check whether vcjob exists:
```
kubectl get vcjob --all-namespaces
```
For example, the name of vcjob is mindx-dls-2p.
```
root@ubuntu:/home/yaml# kubectl get vcjob --all-namespaces
NAMESPACE   NAME           AGE
vcjob       mindx-dls-2p   77s
```
- If vcjob does not exist, the job is not created. If the problem is not caused by Volcano, check whether the job delivery is normal.
- If the vcjob exists, run the 3 command to view the vcjob details.

View the vcjob details.

Based on the vcjob name found in 3, which is mindx-dls-2p in this example, run the following command to view the details of the vcjob:

kubectl describe vcjob mindx-dls-2p -n vcjob

The following is an example of viewing the event part of vcjob:

root@ubuntu:/home/yaml# kubectl describe vcjob mindx-dls-2p -n vcjob
Name:         mindx-dls-2p
Namespace:    vcjob
Labels:       ring-controller.atlas=ascend-910
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"ring-controller.atlas":"ascend-910"},"name"...
API Version:  batch.volcano.sh/v1alpha1
Kind:         Job
...
Status:
  Controlled Resources:
    Plugin - Env:  env
    Plugin - Ssh:  ssh
    Plugin - Svc:  svc
  Min Available:   1
  Pending:         1
  State:
    Last Transition Time:  2020-12-23T21:22:04Z
    Phase:                 Pending
Events:                    <none>

If no event detail is displayed, run the 5 command to view details about the pod corresponding to vcjob.

View pod details.

Find the pod name in 1. In this example, the pod name is mindx-dls-2p-default-2p-0 and its status is Pending.

kubectl describe pod mindx-dls-2p-default-2p-0 -n vcjob

The following is an example of viewing the event part of the pod of vcjob:

root@ubuntu:/home/yaml# kubectl describe pod mindx-dls-2p-default-2p-0 -n vcjob
Name:           mindx-dls-2p-default-2p-0
Namespace:      vcjob
Priority:       0
Node:           <none>
Labels:         app=tf
                ring-controller.atlas=ascend-910
                volcano.sh/job-name=mindx-dls-2p
                volcano.sh/job-namespace=vcjob
Annotations:    scheduling.k8s.io/group-name: mindx-dls-2p
                volcano.sh/job-name: mindx-dls-2p
                volcano.sh/job-version: 0
                volcano.sh/task-spec: default-2p
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  Job/mindx-dls-2p
...
Node-Selectors:  host-arch=huawei-arm
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedScheduling  5m3s  volcano  all nodes are unavailable: 2 selector(host-arch) not equal: task(huawei-arm) node(huawei-x86) conf(huawei-arm|huawei-x86) .

If the fault cannot be located, run the 6 command to view logs and locate the fault.

In this example, the event indicates that the job selector is inconsistent with the configuration. As a result, the allocation fails.

View Volcano logs.
```
kubectl get pod --all-namespaces -o wide
```
Locate the node of volcano-scheduler and view the printed information to further locate the fault. For details about how to troubleshoot common faults, see Training Job Is in the Pending State Because "nodes are unavailable".

Parent topic: Volcano