Pod Status Cannot Be Queried When Volcano V1.7.0 Is Used

Symptom

When Volcano v1.7.0 is used, if the current environment resources are insufficient, running the kubectl get pod --all-namespaces -o wide command to query the pod status fails.

Cause Analysis

When Volcano v1.7.0 is used and resources are insufficient, a pod will not be created and its status cannot be queried.

Solution

  1. Run the following command to query all pod group information and find the pod group corresponding to the job:
    kubectl get pg -A
    Command output:
    NAMESPACE   NAME                                                  STATUS    MINMEMBER   RUNNINGS   AGE
    vcjob       mindx-xxx-16-p-4bf232e4-bd48-438d-9089-02bfef354fce   Inqueue   1                      5m32s
    vcjob       mindx-xxx-2-p-8bf7f0f6-8a7e-4621-a0d0-cafa56785914    Pending   1                      5m15s
    • If the value of STATUS is Inqueue, the pod has been created, and its status can be queried.
    • If the value of STATUS is Pending, the pod fails to be created. In this case, proceed to Step 2 to locate the fault.
  2. Query details about a pod group.
    kubectl describe pg -n <namespace> <podgroup-name>

    Replace <namespace> and <podgroup-name> with the actual namespace and pod group name.

    Example:
    kubectl describe pg -n vcjob mindx-xxx-2-p-8bf7f0f6-8a7e-4621-a0d0-cafa56785914
    Information similar to the following is displayed, indicating that the queue resource quota is insufficient.
    Name:         mindx-xxx-2-p-8bf7f0f6-8a7e-4621-a0d0-cafa56785914
    Namespace:    vcjob
    Labels:       fault-scheduling=force
                  ring-controller.atlas=ascend-{xxx}b
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                    {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"fault-scheduling":"force","ring-controller....
    API Version:  scheduling.volcano.sh/v1beta1
    Kind:         PodGroup
    Metadata:
      Creation Timestamp:  2023-07-05T09:00:02Z
      Generation:          7
      Owner References:
        API Version:           batch.volcano.sh/v1alpha1
        Block Owner Deletion:  true
        Controller:            true
        Kind:                  Job
        Name:                  mindx-xxx-2-p
        UID:                   8bf7f0f6-8a7e-4621-a0d0-cafa56785914
      Resource Version:        17544644
      Self Link:               /apis/scheduling.volcano.sh/v1beta1/namespaces/vcjob/podgroups/mindx-xxx-2-p-8bf7f0f6-8a7e-4621-a0d0-cafa56785914
      UID:                     277cc974-5eec-455f-a860-25d7d19e8335
    Spec:
      Min Member:  1
      Min Resources:
        count/pods:                     1
        huawei.com/Ascend910:           2
        Pods:                           1
        requests.huawei.com/Ascend910:  2
      Min Task Member:
        Default - Test:  1
      Queue:             default
    Status:
      Conditions:
        Last Transition Time:  2023-07-05T09:05:46Z
        Message:               1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
        Reason:                NotEnoughResources
        Status:                True
        Transition ID:         33585c5e-d3ad-4bc4-be0c-c09bea59520e
        Type:                  Unschedulable
      Phase:                   Pending
    Events:
      Type     Reason         Age                     From     Message
      ----     ------         ----                    ----     -------
      Warning  Unschedulable  6m22s (x12 over 6m34s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
      Normal   Unschedulable  93s (x280 over 6m34s)   volcano  queue resource quota insufficient   # Insufficient queue resource quota