昇腾社区首页
中文
注册
开发者
下载

获取集群内当前可用设备信息

  1. 查询ConfigMap。
    kubectl get cm -A | grep cluster-info

    回显示例如下:

    kube-public            cluster-info                                           1      19d
    mindx-dl               cluster-info-device-0                                  1      19h
    mindx-dl               cluster-info-node-cm                                   1      19h
    mindx-dl               cluster-info-switch-0                                  1      19h
  2. 查询ConfigMap的详细信息,获取可用设备信息。下面以节点名为localhost.localdomain为例。
    1. 查询与device相关的ConfigMap的详细信息,获取节点可用芯片信息。
      kubectl describe cm -n mindx-dl cluster-info-device-0

      回显示例如下:

      Name:         cluster-info-device-0
      Namespace:    mindx-dl
      Labels:       mx-consumer-volcano=true
      Annotations:  <none>
      Data
      ====
      cluster-info-device-0:
      ----
      {"mindx-dl-deviceinfo-localhost.localdomain":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-0\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-1\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-2\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}}]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":"Ascend910-0,Ascend910-1,Ascend910-2"},"UpdateTime":1759214666,"CmName":"mindx-dl-deviceinfo-localhost.localdomain","SuperPodID":-2,"ServerIndex":-2},"mindx-dl-deviceinfo-node173":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1759202968,"CmName":"mindx-dl-deviceinfo-node173","SuperPodID":-2,"ServerIndex":-2}}
      Events:  <none>

      从以上回显信息可以看到,该节点的可用芯片为Ascend910-3、Ascend910-4、Ascend910-5、Ascend910-6、Ascend910-7。

    2. 查询与node相关的ConfigMap的详细信息,获取节点状态信息。
      kubectl describe cm -n mindx-dl cluster-info-node-cm

      回显示例如下:

      Name:         cluster-info-node-cm
      Namespace:    mindx-dl
      Labels:       mx-consumer-volcano=true
      Annotations:  <none>
       
      Data
      ====
      cluster-info-node-cm:
      ----
      {"mindx-dl-nodeinfo- localhost.localdomain":{"FaultDevList":[{"DeviceType":"PSU","DeviceId":4,"FaultCode":["0300000D"],"FaultLevel":"NotHandleFault"}],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-localhost.localdomain "}}
       
      BinaryData
      ====
       
      Events:  <none>

      从以上回显信息可以看到,该节点的NodeStatus为Healthy,表示当前节点健康。

    3. 查询与Switch相关的ConfigMap的详细信息,获取节点状态信息。
      kubectl describe cm -n mindx-dl cluster-info-switch-0

      回显示例如下:

      Name:         cluster-info-switch-0
      Namespace:    mindx-dl
      Labels:       mx-consumer-volcano=true
      Annotations:  <none>
       
      Data
      ====
      cluster-info-switch-0:
      ----
      {"mindx-dl-switchinfo-localhost.localdomain ":{"FaultCode":[],"FaultLevel":"","UpdateTime":1763544679,"NodeStatus":"Healthy","FaultTimeAndLevelMap":{},"CmName":"mindx-dl-switchinfo-localhost.localdomain "}}
       
      BinaryData
      ====
       
      Events:  <none>

      从以上回显信息可以看到,该节点的NodeStatus为Healthy,表示当前节点健康。

    综合以上查询结果可知,该节点的可用芯片为Ascend910-3、Ascend910-4、Ascend910-5、Ascend910-6、Ascend910-7。

    若步骤2或步骤3的回显信息中NodeStatus为UnHealthy,则说明当前节点上的设备均不可用。结合步骤1的查询结果可知,该节点的可用芯片为空。

    当集群规模超过1000节点时,cluster-info-device-和mindx-dl-switchinfo-对应的ConfigMap会进行分片。每个cluster-info-device-或mindx-dl-switchinfo-最多包含1000个节点的设备信息。针对此种场景,需要对所有cluster-info-device-的ConfigMap都执行步骤1和步骤3的查询操作,找到目标节点的详细信息,才能确认该节点的可用芯片信息。