vcjob Is Not Started Properly

Symptom

When you run the kubectl get pod -n vcjob command to obtain the pod information in the cluster, the system displays "No resources found in vcjob namespace", indicating that the pod is not started. Then, when you run the kubectl get event -n vcjob command to obtain the event information in the cluster, the message "0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable" is displayed.

Cause Analysis

After a job is created, if its pod is not in the running state, perform the following operations:

Pod not created: Check the acjob or vcjob description and the volcano-controller or Ascend Operator logs to determine whether the pod is not created due to non-compliant fields in the job YAML definition.
Pending pod: Check the pod group description to identify the cause. Typically, pods remain in the pending state because of nodeSelector, tor-affinity, or NotEnoughResources. If the issue is related to nodeSelector or tor-affinity, evaluate the affinity settings. For cases caused by NotEnoughResources, follow the provided solution.

Solution

Query the status of all pod groups, which can be Inqueue or Pending.

kubectl get pg -n vcjob

Command output:

NAME                                                  STATUS    MINMEMBER   RUNNINGS   AGE
mindx-xxx-16-p-4bf232e4-bd48-438d-9089-02bfef354fce   Inqueue   1                      37d
mindx-dl-deviceinfo-worker-1                          Pending   2                      88m

Query details about a pod group.

kubectl describe pg -n <namespace> <podgroup-name>

Replace <namespace> and <podgroup-name> with the actual namespace and pod group name.

The following is an example command, where mindx-dl-deviceinfo-worker-1 is the pod group name queried in Step 1.

kubectl describe pg -n vcjob mindx-dl-deviceinfo-worker-1

The following is an example of the command output, where information in bold indicates insufficient queue resource quotas.

Name:         mindx-dl-deviceinfo-worker-1
Namespace:    vcjob
Labels:       fault-scheduling=force
              ring-controller.atlas=ascend-{xxx}b
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"fault-scheduling":"force","ring-controller....
API Version:  scheduling.volcano.sh/v1beta1
Kind:         PodGroup
Metadata:
  Creation Timestamp:  2023-07-05T09:00:02Z
  Generation:          7
  Owner References:
    API Version:           batch.volcano.sh/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Job
    Name:                  mindx-xxx-2-p
    UID:                   worker-1
  Resource Version:        17544644
  Self Link:               /apis/scheduling.volcano.sh/v1beta1/namespaces/vcjob/mindx-dl-deviceinfo-worker-1
  UID:                     277cc974-5eec-455f-a860-25d7d19e8335
Spec:
  Min Member:  1
  Min Resources:
    count/pods:                     1
    huawei.com/Ascend910:           2
    Pods:                           1
    requests.huawei.com/Ascend910:  2
  Min Task Member:
    Default - Test:  1
  Queue:             default
Status:
  Conditions:
    Last Transition Time:  2023-07-05T09:05:46Z
    Message:               1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         33585c5e-d3ad-4bc4-be0c-c09bea59520e
    Type:                  Unschedulable
  Phase:                   Pending
Events:
  Type     Reason         Age                     From     Message
  ----     ------         ----                    ----     -------
  Warning  Unschedulable  6m22s (x12 over 6m34s)  volcano  0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
  Normal   Unschedulable  93s (x280 over 6m34s)   volcano  queue resource quota insufficient   # Insufficient queue resource quotas

View details about nodes in a Kubernetes cluster. If the Capacity and Allocatable fields in the node details contain information about Ascend AI processors, Ascend Device Plugin reports processor information to Kubernetes and operates normally.
```
kubectl describe node {Node_name_in_a_Kubernetes_cluster}
```
The following is an example of the command output, where the value of huawei.com/Ascend910 of Allocatable is 7 for a node.
Run the npu-smi info command to confirm that no running task is displayed.
Obtain all namespaces in the cluster.
```
kubectl get cm -A
```
Command output:
Obtain all ConfigMap information.
```
kubectl describe cm -n kube-system mindx-dl-deviceinfo-worker-1
```
The information "NPU device-4 CardUnhealthy" is displayed, accompanied by the error code 0x80CD8008.

Query the error code details by referring to the following table.

**Table 1** Fault details
EventID	Level-1 Module	Level-2 Module	Notification Type	Event Name	Fault Description/Possible Cause	Impact	System Action
0x80CD8008	Processor fault	L2BUFF	Event	L2BUFF multi-bit ECC error	Soft errors occur on the chip SRAM, causing the L2BUFF multi-bit error.	The system stops responding. Data is incorrect or a consistency error occurs.	Report the event to the device. Record the error log.

Restart the server.
Check whether the server runs properly after the restart. If it does not (the ConfigMap status shows "unhealthy Ascend910-4"), check whether the card is functional.
- If the card is normal, edit the ConfigMap and delete the abnormal information ManuallySeparateNPU.
```
kubectl edit cm -n kube-system mindx-dl-deviceinfo-worker-1
```
  After the abnormal information is deleted, the problem can be resolved.
- If the card is faulty, contact Huawei technical support.

Parent topic: Faults During Use