Fault Locating

Volcano is responsible for:

  • Managing the life cycle of vcjob and the corresponding pod.
  • Allocating pod resources and scheduling pods.

The following figure shows how to locate the fault.

In this figure, the problem handling logic is as follows:

  1. Check whether the pod exists.
    kubectl get pod --all-namespaces
    • If the pod does not exist, run the 3 command to check whether vcjob exists.
    • If the pod exists, run the 2 command to check the pod status.

    The following is an example:

    root@ubuntu:/home/yaml# kubectl get pod --all-namespaces
    NAMESPACE        NAME                                       READY   STATUS      RESTARTS   AGE
    ...
    kube-system      kube-scheduler-ubuntu                      1/1     Running     184        12d
    vcjob            mindx-dls-2p-default-2p-0                  0/1     Pending     0          37s
    ...
  2. Check the pod status.

    Find the pod name in 1. In this example, the pod name is mindx-dls-2p-default-2p-0 and its status is Pending.

    ...
    vcjob            mindx-dls-2p-default-2p-0                  0/1     Pending     0          3m43s
    ...
    • If the pod status is Pending, generally, resources or scheduling conditions are not met. You can run the 6 command to view the detailed cause or run the 5 command to view the pod details.
  3. Run the following command to check whether vcjob exists:
    kubectl get vcjob --all-namespaces

    For example, the name of vcjob is mindx-dls-2p.

    root@ubuntu:/home/yaml# kubectl get vcjob --all-namespaces
    NAMESPACE   NAME           AGE
    vcjob       mindx-dls-2p   77s
    • If vcjob does not exist, the job is not created. If the problem is not caused by Volcano, check whether the job delivery is normal.
    • If the vcjob exists, run the 3 command to view the vcjob details.
  4. View the vcjob details.

    Based on the vcjob name found in 3, which is mindx-dls-2p in this example, run the following command to view the details of the vcjob:

    kubectl describe vcjob mindx-dls-2p -n vcjob

    The following is an example of viewing the event part of vcjob:

    root@ubuntu:/home/yaml# kubectl describe vcjob mindx-dls-2p -n vcjob
    Name:         mindx-dls-2p
    Namespace:    vcjob
    Labels:       ring-controller.atlas=ascend-910
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                    {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"ring-controller.atlas":"ascend-910"},"name"...
    API Version:  batch.volcano.sh/v1alpha1
    Kind:         Job
    ...
    Status:
      Controlled Resources:
        Plugin - Env:  env
        Plugin - Ssh:  ssh
        Plugin - Svc:  svc
      Min Available:   1
      Pending:         1
      State:
        Last Transition Time:  2020-12-23T21:22:04Z
        Phase:                 Pending
    Events:                    <none>

    If no event detail is displayed, run the 5 command to view details about the pod corresponding to vcjob.

  5. View pod details.

    Find the pod name in 1. In this example, the pod name is mindx-dls-2p-default-2p-0 and its status is Pending.

    kubectl describe pod mindx-dls-2p-default-2p-0 -n vcjob
    The following is an example of viewing the event part of the pod of vcjob:
    root@ubuntu:/home/yaml# kubectl describe pod mindx-dls-2p-default-2p-0 -n vcjob
    Name:           mindx-dls-2p-default-2p-0
    Namespace:      vcjob
    Priority:       0
    Node:           <none>
    Labels:         app=tf
                    ring-controller.atlas=ascend-910
                    volcano.sh/job-name=mindx-dls-2p
                    volcano.sh/job-namespace=vcjob
    Annotations:    scheduling.k8s.io/group-name: mindx-dls-2p
                    volcano.sh/job-name: mindx-dls-2p
                    volcano.sh/job-version: 0
                    volcano.sh/task-spec: default-2p
    Status:         Pending
    IP:             
    IPs:            <none>
    Controlled By:  Job/mindx-dls-2p
    ...
    Node-Selectors:  host-arch=huawei-arm
    Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                     node.kubernetes.io/unreachable:NoExecute for 300s
    Events:
      Type     Reason            Age   From     Message
      ----     ------            ----  ----     -------
      Warning  FailedScheduling  5m3s  volcano  all nodes are unavailable: 2 selector(host-arch) not equal: task(huawei-arm) node(huawei-x86) conf(huawei-arm|huawei-x86) .

    If the fault cannot be located, run the 6 command to view logs and locate the fault.

    In this example, the event indicates that the job selector is inconsistent with the configuration. As a result, the allocation fails.

  6. View Volcano logs.
    kubectl get pod --all-namespaces -o wide

    Locate the node of volcano-scheduler and view the printed information to further locate the fault. For details about how to troubleshoot common faults, see Training Job Is in the Pending State Because "nodes are unavailable".