ClusterD

ClusterD collects internal node faults, processor faults, and interconnect device faults, and stores them as external information in the ConfigMap of the Kubernetes for external query and use.

Node Faults

Query command: kubectl describe cm -n mindx-dl cluster-info-node-cm

The following uses Atlas A3 training product as an example to illustrate the query result. Parameters in the command output may vary according to device types. For details about the key parameters, see Table 1.

{"mindx-dl-nodeinfo-kwok-node-0":{"FaultDevList":[],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-kwok-node-0"},"mindx-dl-deviceinfo-kwok-node-1001":{"FaultDevList":[],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-kwok-node-1001"}}

**Table 1** Faulty nodes parameters
Parameter	Description
mindx-dl-nodeinfo-<kwok-node-0>	The prefix is fixed to mindx-dl-nodeinfo, and kwok-node-0 is the node name, facilitating fault locating.
NodeInfo	Node fault information
FaultDevList	List of faulty devices on a node
- DeviceType	Faulty device type
- DeviceId	ID of the faulty device
- FaultCode	Fault code, a string of characters (hexadecimal) consisted by English characters and numbers.
- FaultLevel	Fault handling level NotHandleFault: requires no handling. PreSeparateFault: If there is a job running on the node, the fault is not handled, and no job is scheduled to the node. SeparateFault: job rescheduling
NodeStatus	Node health status, which is determined by the device with the highest fault handling level on the node. Healthy: The fault handling level on the node does not exceed NotHandleFault. The node is a healthy node and can be trained normally. If the fault handling level on the node is PreSeparateFault and NPUs are being used on the node, the node is deemed healthy. After the job is complete, the node becomes a faulty node. UnHealthy: The fault handling level on the node is SeparateFault. The node is a faulty node, and will affect the training job. Transfer the job immediately from the node. If the fault handling level of the node is PreSeparateFault and no NPU is being used, the node is a faulty node and other jobs cannot be scheduled to this node.

Processor Faults

Query command: kubectl describe cm -n mindx-dl cluster-info-device-${m}

m is an integer starting from 0. Each time 1000 nodes are added to the cluster, a ConfigMap file named cluster-info-device-${m} is added.

The following uses Atlas A3 training product as an example to illustrate the query result. The displayed parameters may vary according to device types. For details about the key parameters, see Table 2.

{"mindx-dl-deviceinfo-kwok-node-0":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1693899390,"CmName":"mindx-dl-deviceinfo-kwok-node-0","SuperPodID":0,"ServerIndex":0},"mindx-dl-deviceinfo-kwok-node-1001":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1693899390,"CmName":"mindx-dl-deviceinfo-kwok-node-1001","SuperPodID":0,"ServerIndex":0}}

**Table 2** cluster-info-device-${m}
Parameter	Description
mindx-dl-deviceinfo-<kwok-node-0>	The prefix is fixed to mindx-dl-deviceinfo, and kwok-node-0 is the node name for locating the faulty node.
huawei.com/Ascend910	Name of the available processor on the current node. If there are multiple processors, use commas (,) to separate them.
huawei.com/Ascend910-NetworkUnhealthy	Name of the processor with unhealthy network on the current node. If there are multiple processors, use commas (,) to separate them.
huawei.com/Ascend910-Unhealthy	Name of the unhealthy processor on the current node. If there are multiple processors, use commas (,) to separate them.
huawei.com/Ascend910-Fault	Array object, including fault_type, npu_name, large_model_fault_level, fault_level, fault_handling, fault_code, and fault_time_and_level_map.
- fault_type	Fault type. CardUnhealthy: processor fault CardNetworkUnhealthy: parameter plane network fault (processor network fault) NodeUnhealthy: node fault PublicFault: public fault
- npu_name	Name of the faulty processor. This parameter is left empty if the node is faulty.
- large_model_fault_level	Fault handling type. This parameter is left empty for node faults. NotHandleFault: requires no handling. RestartRequest: re-executes inference requests in the inference scenario, or re-executes training requests in the training scenario. RestartBusiness: re-executes services. FreeRestartNPU: resets an idle processor when faults affect service execution. RestartNPU: directly resets processors and re-executes services. SeparateNPU: isolates processors. PreSeparateNPU: pre-isolates processors and determines whether to perform rescheduling based on the actual running status of the training job. NOTE: The functions of large_model_fault_level, fault_level and fault_handling are the same. fault_handling is recommended. When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.
- fault_level
- fault_handling
- fault_code	Fault code, a string of characters separated by commas (,).
- fault_time_and_level_map	Fault code, fault occurrence time, and fault handling level.
SuperPodID	SuperPoD ID
ServerIndex	Relative position of the current node in a SuperPoD NOTE: When SuperPodID or ServerIndex reported by the driver is 0xffffffff, the corresponding value of SuperPodID or ServerIndex is -1. The value of SuperPodID or ServerIndex is -2 in the following situations: The current device does not support the query of SuperPoD information. The SuperPoD information fails to be obtained due to a driver problem.

Interconnect Device Faults

Query command: kubectl describe cm -n mindx-dl cluster-info-switch-${m}

m is an integer starting from 0. Each time 2000 nodes are added to the cluster, a ConfigMap file named cluster-info-switch-${m} is added.

{"FaultCode":[000001c1],"FaultLevel":"NotHandle","UpdateTime":1722845555,"NodeStatus":"Healthy"}

**Table 3** Parameters of interconnect device faults
Parameter	Description
FaultCode	Fault code, a string of characters (hexadecimal) consisted by English characters and numbers.
FaultLevel	Policy for handling the fault at the highest level. NotHandle: requires no handling. SubHealth: the handling is determined based on the configured policy. Reset: isolates a node. Separate: isolates a node. Separate: isolates a node.
UpdateTime	Time when ConfigMap is updated.
NodeStatus	Status of the current node. Healthy: The node is healthy. SubHealthy: The node is pre-isolated. The system does not process the jobs and does not schedule the jobs to the node. UnHealthy: The node is unhealthy. Isolate the node and reschedule jobs.

Parent topic: Query the Reported Fault Information