vcjob Is Not Started Properly

Symptom

When you run the kubectl get pod -n vcjob command to obtain the pod information in the cluster, the system displays "No resources found in vcjob namespace", indicating that the pod is not started. Then, when you run the kubectl get event -n vcjob command to obtain the event information in the cluster, the message "0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable" is displayed.

Cause Analysis

After a job is created, if its pod is not in the running state, perform the following operations:

  • Pod not created: Check the acjob or vcjob description and the volcano-controller or Ascend Operator logs to determine whether the pod is not created due to non-compliant fields in the job YAML definition.
  • Pending pod: Check the pod group description to identify the cause. Typically, pods remain in the pending state because of nodeSelector, tor-affinity, or NotEnoughResources. If the issue is related to nodeSelector or tor-affinity, evaluate the affinity settings. For cases caused by NotEnoughResources, follow the provided solution.

Solution

  1. Query the status of all pod groups, which can be Inqueue or Pending.
    kubectl get pg -n vcjob
    Command output:
    NAME                                                  STATUS    MINMEMBER   RUNNINGS   AGE
    mindx-xxx-16-p-4bf232e4-bd48-438d-9089-02bfef354fce   Inqueue   1                      37d
    mindx-dl-deviceinfo-worker-1                          Pending   2                      88m
  2. Query details about a pod group.
    kubectl describe pg -n <namespace> <podgroup-name>

    Replace <namespace> and <podgroup-name> with the actual namespace and pod group name.

    The following is an example command, where mindx-dl-deviceinfo-worker-1 is the pod group name queried in Step 1.
    kubectl describe pg -n vcjob mindx-dl-deviceinfo-worker-1
    The following is an example of the command output, where information in bold indicates insufficient queue resource quotas.
    Name:         mindx-dl-deviceinfo-worker-1
    Namespace:    vcjob
    Labels:       fault-scheduling=force
                  ring-controller.atlas=ascend-{xxx}b
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                    {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"fault-scheduling":"force","ring-controller....
    API Version:  scheduling.volcano.sh/v1beta1
    Kind:         PodGroup
    Metadata:
      Creation Timestamp:  2023-07-05T09:00:02Z
      Generation:          7
      Owner References:
        API Version:           batch.volcano.sh/v1alpha1
        Block Owner Deletion:  true
        Controller:            true
        Kind:                  Job
        Name:                  mindx-xxx-2-p
        UID:                   worker-1
      Resource Version:        17544644
      Self Link:               /apis/scheduling.volcano.sh/v1beta1/namespaces/vcjob/mindx-dl-deviceinfo-worker-1
      UID:                     277cc974-5eec-455f-a860-25d7d19e8335
    Spec:
      Min Member:  1
      Min Resources:
        count/pods:                     1
        huawei.com/Ascend910:           2
        Pods:                           1
        requests.huawei.com/Ascend910:  2
      Min Task Member:
        Default - Test:  1
      Queue:             default
    Status:
      Conditions:
        Last Transition Time:  2023-07-05T09:05:46Z
        Message:               1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
        Reason:                NotEnoughResources
        Status:                True
        Transition ID:         33585c5e-d3ad-4bc4-be0c-c09bea59520e
        Type:                  Unschedulable
      Phase:                   Pending
    Events:
      Type     Reason         Age                     From     Message
      ----     ------         ----                    ----     -------
      Warning  Unschedulable  6m22s (x12 over 6m34s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
      Normal   Unschedulable  93s (x280 over 6m34s)   volcano  queue resource quota insufficient   # Insufficient queue resource quotas
  3. View details about nodes in a Kubernetes cluster. If the Capacity and Allocatable fields in the node details contain information about Ascend AI processors, Ascend Device Plugin reports processor information to Kubernetes and operates normally.
    kubectl describe node {Node_name_in_a_Kubernetes_cluster}

    The following is an example of the command output, where the value of huawei.com/Ascend910 of Allocatable is 7 for a node.

  4. Run the npu-smi info command to confirm that no running task is displayed.

  5. Obtain all namespaces in the cluster.
    kubectl get cm -A

    Command output:

  6. Obtain all ConfigMap information.
    kubectl describe cm -n kube-system mindx-dl-deviceinfo-worker-1

    The information "NPU device-4 CardUnhealthy" is displayed, accompanied by the error code 0x80CD8008.

  7. Query the error code details by referring to the following table.
    Table 1 Fault details

    EventID

    Level-1 Module

    Level-2 Module

    Notification Type

    Event Name

    Fault Description/Possible Cause

    Impact

    System Action

    0x80CD8008

    Processor fault

    L2BUFF

    Event

    L2BUFF multi-bit ECC error

    Soft errors occur on the chip SRAM, causing the L2BUFF multi-bit error.

    The system stops responding. Data is incorrect or a consistency error occurs.

    1. Report the event to the device.
    2. Record the error log.
  8. Restart the server.
  9. Check whether the server runs properly after the restart. If it does not (the ConfigMap status shows "unhealthy Ascend910-4"), check whether the card is functional.
    • If the card is normal, edit the ConfigMap and delete the abnormal information ManuallySeparateNPU.
      kubectl edit cm -n kube-system mindx-dl-deviceinfo-worker-1

      After the abnormal information is deleted, the problem can be resolved.

    • If the card is faulty, contact Huawei technical support.