Node信息中Allocatable. huawei.com/Ascend910对应的芯片数量为8,下发8卡任务,任务处于Pending状态
问题现象描述
通过kubectl describe node {node name}命令查看Node信息,Allocatable.huawei.com/Ascend910对应的芯片数量为8,下发8卡任务,任务处于Pending状态。
Capacity: cpu: 72 ephemeral-storage: 1843598940Ki huawei.com/Ascend910: 8 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 659447564Ki pods: 110 Allocatable: cpu: 72 ephemeral-storage: 1699060780291 huawei.com/Ascend910: 8 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 659345164Ki pods: 110
原因分析
该节点上可能存在非Ascend Device Plugin感知的公共故障。
解决措施
- 执行以下命令查询ConfigMap。
kubectl get cm -A | grep cluster-info
回显示例如下:
kube-public cluster-info 1 19d mindx-dl cluster-info-device-0 1 19h mindx-dl cluster-info-node-cm 1 19h
- 执行以下命令查询该ConfigMap的详细信息,获取节点可用芯片信息。
kubectl describe cm -n mindx-dl cluster-info-device-0
回显示例如下:
Name: cluster-info-device-0 Namespace: mindx-dl Labels: mx-consumer-volcano=true Annotations: <none> Data ==== cluster-info-device-0: ---- {"mindx-dl-deviceinfo-localhost.localdomain":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-0\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-1\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-2\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}}]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":"Ascend910-0,Ascend910-1,Ascend910-2"},"UpdateTime":1759214666,"CmName":"mindx-dl-deviceinfo-localhost.localdomain","SuperPodID":-2,"ServerIndex":-2},"mindx-dl-deviceinfo-node173":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1759202968,"CmName":"mindx-dl-deviceinfo-node173","SuperPodID":-2,"ServerIndex":-2}} Events: <none>从以上回显信息可以看到,该节点(节点名为localhost.localdomain)的可用芯片为Ascend910-3、Ascend910-4、Ascend910-5、Ascend910-6、Ascend910-7。
父主题: 使用时出现的故障