Obtaining Information About Available Devices in a Cluster

  1. Query ConfigMaps.
    kubectl get cm -A | grep cluster-info

    Command output:

    kube-public            cluster-info                                           1      19d
    mindx-dl               cluster-info-device-0                                  1      19h
    mindx-dl               cluster-info-node-cm                                   1      19h
    mindx-dl               cluster-info-switch-0                                  1      19h
  2. Query ConfigMap details to obtain information on available devices. The following uses the node named localhost.localdomain as an example.
    1. Query ConfigMap details related to the device to obtain information on available processors.
      kubectl describe cm -n mindx-dl cluster-info-device-0

      Command output:

      Name:         cluster-info-device-0
      Namespace:    mindx-dl
      Labels:       mx-consumer-volcano=true
      Annotations:  <none>
      Data
      ====
      cluster-info-device-0:
      ----
      {"mindx-dl-deviceinfo-localhost.localdomain":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-0\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-1\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-2\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}}]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":"Ascend910-0,Ascend910-1,Ascend910-2"},"UpdateTime":1759214666,"CmName":"mindx-dl-deviceinfo-localhost.localdomain","SuperPodID":-2,"ServerIndex":-2},"mindx-dl-deviceinfo-node173":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1759202968,"CmName":"mindx-dl-deviceinfo-node173","SuperPodID":-2,"ServerIndex":-2}}
      Events:  <none>

      As shown in the preceding output, the available processors of the node are Ascend910-3, Ascend910-4, Ascend910-5, Ascend910-6, and Ascend910-7.

    2. Query ConfigMap details related to the node to obtain the node status.
      kubectl describe cm -n mindx-dl cluster-info-node-cm

      Command output:

      Name:         cluster-info-node-cm
      Namespace:    mindx-dl
      Labels:       mx-consumer-volcano=true
      Annotations:  <none>
       
      Data
      ====
      cluster-info-node-cm:
      ----
      {"mindx-dl-nodeinfo- localhost.localdomain":{"FaultDevList":[{"DeviceType":"PSU","DeviceId":4,"FaultCode":["0300000D"],"FaultLevel":"NotHandleFault"}],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-localhost.localdomain "}}
       
      BinaryData
      ====
       
      Events:  <none>

      As shown in the preceding command output, NodeStatus is Healthy, indicating that the node is healthy.

    3. Query ConfigMap details related to the switch to obtain the node status.
      kubectl describe cm -n mindx-dl cluster-info-switch-0

      Command output:

      Name:         cluster-info-switch-0
      Namespace:    mindx-dl
      Labels:       mx-consumer-volcano=true
      Annotations:  <none>
       
      Data
      ====
      cluster-info-switch-0:
      ----
      {"mindx-dl-switchinfo-localhost.localdomain ":{"FaultCode":[],"FaultLevel":"","UpdateTime":1763544679,"NodeStatus":"Healthy","FaultTimeAndLevelMap":{},"CmName":"mindx-dl-switchinfo-localhost.localdomain "}}
       
      BinaryData
      ====
       
      Events:  <none>

      As shown in the preceding command output, NodeStatus is Healthy, indicating that the node is healthy.

    As shown in the preceding query results, the available processors of the node are Ascend910-3, Ascend910-4, Ascend910-5, Ascend910-6, and Ascend910-7.

    If NodeStatus is UnHealthy in the command output of step 2 or step 3, all devices on the node are unavailable. According to the query result in step 1, there are no available processors of node.

    When a cluster contains more than 1000 nodes, ConfigMap corresponding to cluster-info-device- and mindx-dl-switchinfo- are partitioned. Each cluster-info-device- or mindx-dl-switchinfo- contains the device information of a maximum of 1000 nodes. In this scenario, you need to perform steps 1 and 3 on all ConfigMaps of cluster-info-device- to find the detailed information about the target node and determine the available processor of the node.