昇腾社区首页
中文
注册
开发者
下载

Node信息中Allocatable. huawei.com/Ascend910对应的芯片数量为8,下发8卡任务,任务处于Pending状态

问题现象描述

通过kubectl describe node {node name}命令查看Node信息,Allocatable.huawei.com/Ascend910对应的芯片数量为8,下发8卡任务,任务处于Pending状态。

Capacity:
  cpu:                   72
  ephemeral-storage:     1843598940Ki
  huawei.com/Ascend910:  8
  hugepages-1Gi:         0
  hugepages-2Mi:         0
  memory:                659447564Ki
  pods:                  110
Allocatable:
  cpu:                   72
  ephemeral-storage:     1699060780291
  huawei.com/Ascend910:  8
  hugepages-1Gi:         0
  hugepages-2Mi:         0
  memory:                659345164Ki
  pods:                  110

原因分析

该节点上可能存在非Ascend Device Plugin感知的公共故障。

解决措施

  1. 执行以下命令查询ConfigMap。
    kubectl get cm -A | grep cluster-info

    回显示例如下:

    kube-public            cluster-info                                           1      19d
    mindx-dl              cluster-info-device-0                                  1      19h
    mindx-dl               cluster-info-node-cm                                   1      19h
  2. 执行以下命令查询该ConfigMap的详细信息,获取节点可用芯片信息。
    kubectl describe cm -n mindx-dl cluster-info-device-0

    回显示例如下:

    Name:         cluster-info-device-0
    Namespace:    mindx-dl
    Labels:       mx-consumer-volcano=true
    Annotations:  <none>
    Data
    ====
    cluster-info-device-0:
    ----
    {"mindx-dl-deviceinfo-localhost.localdomain":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-0\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-1\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-2\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}}]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":"Ascend910-0,Ascend910-1,Ascend910-2"},"UpdateTime":1759214666,"CmName":"mindx-dl-deviceinfo-localhost.localdomain","SuperPodID":-2,"ServerIndex":-2},"mindx-dl-deviceinfo-node173":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1759202968,"CmName":"mindx-dl-deviceinfo-node173","SuperPodID":-2,"ServerIndex":-2}}
    Events:  <none>

    从以上回显信息可以看到,该节点(节点名为localhost.localdomain)的可用芯片为Ascend910-3、Ascend910-4、Ascend910-5、Ascend910-6、Ascend910-7。