ClusterD
ClusterD collects internal node faults, processor faults, and interconnect device faults, and stores them as external information in the ConfigMap of the Kubernetes for external query and use.
Node Faults
Query command: kubectl describe cm -n mindx-dl cluster-info-node-cm
The following uses
{"mindx-dl-nodeinfo-kwok-node-0":{"FaultDevList":[],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-kwok-node-0"},"mindx-dl-deviceinfo-kwok-node-1001":{"FaultDevList":[],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-kwok-node-1001"}}
Parameter |
Description |
|---|---|
mindx-dl-nodeinfo-<kwok-node-0> |
The prefix is fixed to mindx-dl-nodeinfo, and kwok-node-0 is the node name, facilitating fault locating. |
NodeInfo |
Node fault information |
FaultDevList |
List of faulty devices on a node |
- DeviceType |
Faulty device type |
- DeviceId |
ID of the faulty device |
- FaultCode |
Fault code, a string of characters (hexadecimal) consisted by English characters and numbers. |
- FaultLevel |
Fault handling level
|
NodeStatus |
Node health status, which is determined by the device with the highest fault handling level on the node.
|
Processor Faults
Query command: kubectl describe cm -n mindx-dl cluster-info-device-${m}
m is an integer starting from 0. Each time 1000 nodes are added to the cluster, a ConfigMap file named cluster-info-device-${m} is added.
The following uses
{"mindx-dl-deviceinfo-kwok-node-0":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1693899390,"CmName":"mindx-dl-deviceinfo-kwok-node-0","SuperPodID":0,"ServerIndex":0},"mindx-dl-deviceinfo-kwok-node-1001":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1693899390,"CmName":"mindx-dl-deviceinfo-kwok-node-1001","SuperPodID":0,"ServerIndex":0}}
Parameter |
Description |
|---|---|
mindx-dl-deviceinfo-<kwok-node-0> |
The prefix is fixed to mindx-dl-deviceinfo, and kwok-node-0 is the node name for locating the faulty node. |
huawei.com/Ascend910 |
Name of the available processor on the current node. If there are multiple processors, use commas (,) to separate them. |
huawei.com/Ascend910-NetworkUnhealthy |
Name of the processor with unhealthy network on the current node. If there are multiple processors, use commas (,) to separate them. |
huawei.com/Ascend910-Unhealthy |
Name of the unhealthy processor on the current node. If there are multiple processors, use commas (,) to separate them. |
huawei.com/Ascend910-Fault |
Array object, including fault_type, npu_name, large_model_fault_level, fault_level, fault_handling, fault_code, and fault_time_and_level_map. |
- fault_type |
Fault type.
|
- npu_name |
Name of the faulty processor. This parameter is left empty if the node is faulty. |
- large_model_fault_level |
Fault handling type. This parameter is left empty for node faults.
NOTE:
|
- fault_level |
|
- fault_handling |
|
- fault_code |
Fault code, a string of characters separated by commas (,). |
- fault_time_and_level_map |
Fault code, fault occurrence time, and fault handling level. |
SuperPodID |
SuperPoD ID |
ServerIndex |
Relative position of the current node in a SuperPoD NOTE:
|
Interconnect Device Faults
Query command: kubectl describe cm -n mindx-dl cluster-info-switch-${m}
m is an integer starting from 0. Each time 2000 nodes are added to the cluster, a ConfigMap file named cluster-info-switch-${m} is added.
{"FaultCode":[000001c1],"FaultLevel":"NotHandle","UpdateTime":1722845555,"NodeStatus":"Healthy"}
Parameter |
Description |
|---|---|
FaultCode |
Fault code, a string of characters (hexadecimal) consisted by English characters and numbers. |
FaultLevel |
Policy for handling the fault at the highest level.
|
UpdateTime |
Time when ConfigMap is updated. |
NodeStatus |
Status of the current node.
|