获取集群内当前可用设备信息
- 查询ConfigMap。
kubectl get cm -A | grep cluster-info
回显示例如下:
kube-public cluster-info 1 19d mindx-dl cluster-info-device-0 1 19h mindx-dl cluster-info-node-cm 1 19h mindx-dl cluster-info-switch-0 1 19h
- 查询ConfigMap的详细信息,获取可用设备信息。下面以节点名为localhost.localdomain为例。
- 查询与device相关的ConfigMap的详细信息,获取节点可用芯片信息。
kubectl describe cm -n mindx-dl cluster-info-device-0
回显示例如下:
Name: cluster-info-device-0 Namespace: mindx-dl Labels: mx-consumer-volcano=true Annotations: <none> Data ==== cluster-info-device-0: ---- {"mindx-dl-deviceinfo-localhost.localdomain":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-0\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-1\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}},{\"fault_type\":\"PublicFault\",\"npu_name\":\"Ascend910-2\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"220001001\",\"fault_time_and_level_map\":{\"220001001\":{\"fault_time\":1736926605,\"fault_level\":\"SeparateNPU\"}}}]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":"Ascend910-0,Ascend910-1,Ascend910-2"},"UpdateTime":1759214666,"CmName":"mindx-dl-deviceinfo-localhost.localdomain","SuperPodID":-2,"ServerIndex":-2},"mindx-dl-deviceinfo-node173":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1759202968,"CmName":"mindx-dl-deviceinfo-node173","SuperPodID":-2,"ServerIndex":-2}} Events: <none>从以上回显信息可以看到,该节点的可用芯片为Ascend910-3、Ascend910-4、Ascend910-5、Ascend910-6、Ascend910-7。
- 查询与node相关的ConfigMap的详细信息,获取节点状态信息。
kubectl describe cm -n mindx-dl cluster-info-node-cm
回显示例如下:
Name: cluster-info-node-cm Namespace: mindx-dl Labels: mx-consumer-volcano=true Annotations: <none> Data ==== cluster-info-node-cm: ---- {"mindx-dl-nodeinfo- localhost.localdomain":{"FaultDevList":[{"DeviceType":"PSU","DeviceId":4,"FaultCode":["0300000D"],"FaultLevel":"NotHandleFault"}],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-localhost.localdomain "}} BinaryData ==== Events: <none>从以上回显信息可以看到,该节点的NodeStatus为Healthy,表示当前节点健康。
- 查询与Switch相关的ConfigMap的详细信息,获取节点状态信息。
kubectl describe cm -n mindx-dl cluster-info-switch-0
回显示例如下:
Name: cluster-info-switch-0 Namespace: mindx-dl Labels: mx-consumer-volcano=true Annotations: <none> Data ==== cluster-info-switch-0: ---- {"mindx-dl-switchinfo-localhost.localdomain ":{"FaultCode":[],"FaultLevel":"","UpdateTime":1763544679,"NodeStatus":"Healthy","FaultTimeAndLevelMap":{},"CmName":"mindx-dl-switchinfo-localhost.localdomain "}} BinaryData ==== Events: <none>从以上回显信息可以看到,该节点的NodeStatus为Healthy,表示当前节点健康。
综合以上查询结果可知,该节点的可用芯片为Ascend910-3、Ascend910-4、Ascend910-5、Ascend910-6、Ascend910-7。
若步骤2或步骤3的回显信息中NodeStatus为UnHealthy,则说明当前节点上的设备均不可用。结合步骤1的查询结果可知,该节点的可用芯片为空。
当集群规模超过1000节点时,cluster-info-device-和mindx-dl-switchinfo-对应的ConfigMap会进行分片。每个cluster-info-device-或mindx-dl-switchinfo-最多包含1000个节点的设备信息。针对此种场景,需要对所有cluster-info-device-的ConfigMap都执行步骤1和步骤3的查询操作,找到目标节点的详细信息,才能确认该节点的可用芯片信息。
- 查询与device相关的ConfigMap的详细信息,获取节点可用芯片信息。
父主题: 常用操作