执行kubectl get pod -n vcjob命令获取集群中Pod信息时,显示Pod未正常拉起,并提示:No resources found in vcjob namespace。再执行kubectl get event -n vcjob命令获取集群内的事件信息,报错:0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
创建任务后,若任务Pod未处于running状态,可按照以下2种情况进行排查。
kubectl get pg -n vcjob
NAME STATUS MINMEMBER RUNNINGS AGE mindx-xxx-16-p-4bf232e4-bd48-438d-9089-02bfef354fce Inqueue 1 37d mindx-dl-deviceinfo-worker-1 Pending 2 88m
kubectl describe pg -n <namespace> <podgroup-name>
<namespace>和<podgroup-name>需要用实际的命名空间和podgroup名称进行替换。
kubectl describe pg -n vcjob mindx-dl-deviceinfo-worker-1
Name: mindx-dl-deviceinfo-worker-1 Namespace: vcjob Labels: fault-scheduling=force ring-controller.atlas=ascend-{xxx}b Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"batch.volcano.sh/v1alpha1","kind":"Job","metadata":{"annotations":{},"labels":{"fault-scheduling":"force","ring-controller.... API Version: scheduling.volcano.sh/v1beta1 Kind: PodGroup Metadata: Creation Timestamp: 2023-07-05T09:00:02Z Generation: 7 Owner References: API Version: batch.volcano.sh/v1alpha1 Block Owner Deletion: true Controller: true Kind: Job Name: mindx-xxx-2-p UID: worker-1 Resource Version: 17544644 Self Link: /apis/scheduling.volcano.sh/v1beta1/namespaces/vcjob/mindx-dl-deviceinfo-worker-1 UID: 277cc974-5eec-455f-a860-25d7d19e8335 Spec: Min Member: 1 Min Resources: count/pods: 1 huawei.com/Ascend910: 2 Pods: 1 requests.huawei.com/Ascend910: 2 Min Task Member: Default - Test: 1 Queue: default Status: Conditions: Last Transition Time: 2023-07-05T09:05:46Z Message: 1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable Reason: NotEnoughResources Status: True Transition ID: 33585c5e-d3ad-4bc4-be0c-c09bea59520e Type: Unschedulable Phase: Pending Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unschedulable 6m22s (x12 over 6m34s) volcano 0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable Normal Unschedulable 93s (x280 over 6m34s) volcano queue resource quota insufficient # queue资源配额不足
kubectl describe node K8s中的节点名
回显如下所示,提示有个节点Allocatable的huawei.com/Ascend910为7。
kubectl get cm -A
回显如下所示。
kubectl describe cm -n kube-system mindx-dl-deviceinfo-worker-1
信息显示NPU device-4 CardUnhealthy,错误码为0x80CD8008。
EventID |
所属一级模块 |
所属二级模块 |
通知类型 |
故障事件名称 |
故障解释/可能原因 |
故障影响 |
故障自处理模式 |
---|---|---|---|---|---|---|---|
0x80CD8008 |
芯片故障 |
L2BUFF |
故障事件 |
L2BUFF多bit ECC错误 |
片内SRAM软失败,导致L2BUFF多bit错误。 |
系统停止响应,数据错误或可能出现一致性错误。 |
1.上报故障事件到设备 2.记录错误日志 |
kubectl edit cm -n kube-system mindx-dl-deviceinfo-worker-1
删除异常信息后,问题已解决。