Obtaining Information About Available Devices in a Cluster
- Query ConfigMaps.
kubectl get cm -A | grep cluster-info
Command output:
kube-public cluster-info 1 19d mindx-dl cluster-info-device-0 1 19h mindx-dl cluster-info-node-cm 1 19h mindx-dl cluster-info-switch-0 1 19h
- Query ConfigMap details to obtain information on available devices. The following uses the node named localhost.localdomain as an example.
- Query ConfigMap details related to the device to obtain information on available processors.
kubectl describe cm -n mindx-dl cluster-info-device-0
Command output:
Name: cluster-info-device-0 Namespace: mindx-dl Labels: mx-consumer-volcano=true Annotations: <none> Data ==== cluster-info-device-0: ---- {"mindx-dl-deviceinfo-localhost.localdomain":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-0\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-1\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-2\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}}]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":"Ascend910-0,Ascend910-1,Ascend910-2"},"UpdateTime":1759214666,"CmName":"mindx-dl-deviceinfo-localhost.localdomain","SuperPodID":-2,"ServerIndex":-2},"mindx-dl-deviceinfo-node173":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1759202968,"CmName":"mindx-dl-deviceinfo-node173","SuperPodID":-2,"ServerIndex":-2}} Events: <none>As shown in the preceding output, the available processors of the node are Ascend910-3, Ascend910-4, Ascend910-5, Ascend910-6, and Ascend910-7.
- Query ConfigMap details related to the node to obtain the node status.
kubectl describe cm -n mindx-dl cluster-info-node-cm
Command output:
Name: cluster-info-node-cm Namespace: mindx-dl Labels: mx-consumer-volcano=true Annotations: <none> Data ==== cluster-info-node-cm: ---- {"mindx-dl-nodeinfo- localhost.localdomain":{"FaultDevList":[{"DeviceType":"PSU","DeviceId":4,"FaultCode":["0300000D"],"FaultLevel":"NotHandleFault"}],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-localhost.localdomain "}} BinaryData ==== Events: <none>As shown in the preceding command output, NodeStatus is Healthy, indicating that the node is healthy.
- Query ConfigMap details related to the switch to obtain the node status.
kubectl describe cm -n mindx-dl cluster-info-switch-0
Command output:
Name: cluster-info-switch-0 Namespace: mindx-dl Labels: mx-consumer-volcano=true Annotations: <none> Data ==== cluster-info-switch-0: ---- {"mindx-dl-switchinfo-localhost.localdomain ":{"FaultCode":[],"FaultLevel":"","UpdateTime":1763544679,"NodeStatus":"Healthy","FaultTimeAndLevelMap":{},"CmName":"mindx-dl-switchinfo-localhost.localdomain "}} BinaryData ==== Events: <none>As shown in the preceding command output, NodeStatus is Healthy, indicating that the node is healthy.
As shown in the preceding query results, the available processors of the node are Ascend910-3, Ascend910-4, Ascend910-5, Ascend910-6, and Ascend910-7.
If NodeStatus is UnHealthy in the command output of step 2 or step 3, all devices on the node are unavailable. According to the query result in step 1, there are no available processors of node.
When a cluster contains more than 1000 nodes, ConfigMap corresponding to cluster-info-device- and mindx-dl-switchinfo- are partitioned. Each cluster-info-device- or mindx-dl-switchinfo- contains the device information of a maximum of 1000 nodes. In this scenario, you need to perform steps 1 and 3 on all ConfigMaps of cluster-info-device- to find the detailed information about the target node and determine the available processor of the node.
- Query ConfigMap details related to the device to obtain information on available processors.