vcjob Is Not Started Properly
Symptom
When you run the kubectl get pod -n vcjob command to obtain the pod information in the cluster, the system displays "No resources found in vcjob namespace", indicating that the pod is not started. Then, when you run the kubectl get event -n vcjob command to obtain the event information in the cluster, the message "0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable" is displayed.
Cause Analysis
After a job is created, if its pod is not in the running state, perform the following operations:
- Pod not created: Check the acjob or vcjob description and the volcano-controller or Ascend Operator logs to determine whether the pod is not created due to non-compliant fields in the job YAML definition.
- Pending pod: Check the pod group description to identify the cause. Typically, pods remain in the pending state because of nodeSelector, tor-affinity, or NotEnoughResources. If the issue is related to nodeSelector or tor-affinity, evaluate the affinity settings. For cases caused by NotEnoughResources, follow the provided solution.
Solution
- Query the status of all pod groups, which can be Inqueue or Pending.
kubectl get pg -n vcjob
Command output:NAME STATUS MINMEMBER RUNNINGS AGE mindx-xxx-16-p-4bf232e4-bd48-438d-9089-02bfef354fce Inqueue 1 37d mindx-dl-deviceinfo-worker-1 Pending 2 88m
- Query details about a pod group.
kubectl describe pg -n <namespace> <podgroup-name>
Replace <namespace> and <podgroup-name> with the actual namespace and pod group name.
The following is an example command, where mindx-dl-deviceinfo-worker-1 is the pod group name queried in Step 1.kubectl describe pg -n vcjob mindx-dl-deviceinfo-worker-1
The following is an example of the command output, where information in bold indicates insufficient queue resource quotas.Name: mindx-dl-deviceinfo-worker-1 Namespace: vcjob Labels: fault-scheduling=force ring-controller.atlas=ascend-{xxx}b Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"fault-scheduling":"force","ring-controller.... API Version: scheduling.volcano.sh/v1beta1 Kind: PodGroup Metadata: Creation Timestamp: 2023-07-05T09:00:02Z Generation: 7 Owner References: API Version: batch.volcano.sh/v1alpha1 Block Owner Deletion: true Controller: true Kind: Job Name: mindx-xxx-2-p UID: worker-1 Resource Version: 17544644 Self Link: /apis/scheduling.volcano.sh/v1beta1/namespaces/vcjob/mindx-dl-deviceinfo-worker-1 UID: 277cc974-5eec-455f-a860-25d7d19e8335 Spec: Min Member: 1 Min Resources: count/pods: 1 huawei.com/Ascend910: 2 Pods: 1 requests.huawei.com/Ascend910: 2 Min Task Member: Default - Test: 1 Queue: default Status: Conditions: Last Transition Time: 2023-07-05T09:05:46Z Message: 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable Reason: NotEnoughResources Status: True Transition ID: 33585c5e-d3ad-4bc4-be0c-c09bea59520e Type: Unschedulable Phase: Pending Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unschedulable 6m22s (x12 over 6m34s) volcano 0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable Normal Unschedulable 93s (x280 over 6m34s) volcano queue resource quota insufficient # Insufficient queue resource quotas - View details about nodes in a Kubernetes cluster. If the Capacity and Allocatable fields in the node details contain information about Ascend AI processors, Ascend Device Plugin reports processor information to Kubernetes and operates normally.
kubectl describe node {Node_name_in_a_Kubernetes_cluster}The following is an example of the command output, where the value of huawei.com/Ascend910 of Allocatable is 7 for a node.

- Run the npu-smi info command to confirm that no running task is displayed.

- Obtain all namespaces in the cluster.
kubectl get cm -A
Command output:

- Obtain all ConfigMap information.
kubectl describe cm -n kube-system mindx-dl-deviceinfo-worker-1
The information "NPU device-4 CardUnhealthy" is displayed, accompanied by the error code 0x80CD8008.

- Query the error code details by referring to the following table.
Table 1 Fault details EventID
Level-1 Module
Level-2 Module
Notification Type
Event Name
Fault Description/Possible Cause
Impact
System Action
0x80CD8008
Processor fault
L2BUFF
Event
L2BUFF multi-bit ECC error
Soft errors occur on the chip SRAM, causing the L2BUFF multi-bit error.
The system stops responding. Data is incorrect or a consistency error occurs.
- Report the event to the device.
- Record the error log.
- Restart the server.
- Check whether the server runs properly after the restart. If it does not (the ConfigMap status shows "unhealthy Ascend910-4"), check whether the card is functional.
- If the card is normal, edit the ConfigMap and delete the abnormal information ManuallySeparateNPU.
kubectl edit cm -n kube-system mindx-dl-deviceinfo-worker-1
After the abnormal information is deleted, the problem can be resolved.

- If the card is faulty, contact Huawei technical support.
- If the card is normal, edit the ConfigMap and delete the abnormal information ManuallySeparateNPU.