ClusterD

ClusterD collects internal node faults, processor faults, and interconnect device faults, and stores them as external information in the ConfigMap of the Kubernetes for external query and use.

Node Faults

Query command: kubectl describe cm -n mindx-dl cluster-info-node-cm

The following uses Atlas A3 training product as an example to illustrate the query result. Parameters in the command output may vary according to device types. For details about the key parameters, see Table 1.

{"mindx-dl-nodeinfo-kwok-node-0":{"FaultDevList":[],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-kwok-node-0"},"mindx-dl-deviceinfo-kwok-node-1001":{"FaultDevList":[],"NodeStatus":"Healthy","CmName":"mindx-dl-nodeinfo-kwok-node-1001"}}
Table 1 Faulty nodes parameters

Parameter

Description

mindx-dl-nodeinfo-<kwok-node-0>

The prefix is fixed to mindx-dl-nodeinfo, and kwok-node-0 is the node name, facilitating fault locating.

NodeInfo

Node fault information

FaultDevList

List of faulty devices on a node

- DeviceType

Faulty device type

- DeviceId

ID of the faulty device

- FaultCode

Fault code, a string of characters (hexadecimal) consisted by English characters and numbers.

- FaultLevel

Fault handling level

  • NotHandleFault: requires no handling.
  • PreSeparateFault: If there is a job running on the node, the fault is not handled, and no job is scheduled to the node.
  • SeparateFault: job rescheduling

NodeStatus

Node health status, which is determined by the device with the highest fault handling level on the node.

  • Healthy: The fault handling level on the node does not exceed NotHandleFault. The node is a healthy node and can be trained normally. If the fault handling level on the node is PreSeparateFault and NPUs are being used on the node, the node is deemed healthy. After the job is complete, the node becomes a faulty node.
  • UnHealthy: The fault handling level on the node is SeparateFault. The node is a faulty node, and will affect the training job. Transfer the job immediately from the node. If the fault handling level of the node is PreSeparateFault and no NPU is being used, the node is a faulty node and other jobs cannot be scheduled to this node.

Processor Faults

Query command: kubectl describe cm -n mindx-dl cluster-info-device-${m}

m is an integer starting from 0. Each time 1000 nodes are added to the cluster, a ConfigMap file named cluster-info-device-${m} is added.

The following uses Atlas A3 training product as an example to illustrate the query result. The displayed parameters may vary according to device types. For details about the key parameters, see Table 2.

{"mindx-dl-deviceinfo-kwok-node-0":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1693899390,"CmName":"mindx-dl-deviceinfo-kwok-node-0","SuperPodID":0,"ServerIndex":0},"mindx-dl-deviceinfo-kwok-node-1001":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1693899390,"CmName":"mindx-dl-deviceinfo-kwok-node-1001","SuperPodID":0,"ServerIndex":0}}
Table 2 cluster-info-device-${m}

Parameter

Description

mindx-dl-deviceinfo-<kwok-node-0>

The prefix is fixed to mindx-dl-deviceinfo, and kwok-node-0 is the node name for locating the faulty node.

huawei.com/Ascend910

Name of the available processor on the current node. If there are multiple processors, use commas (,) to separate them.

huawei.com/Ascend910-NetworkUnhealthy

Name of the processor with unhealthy network on the current node. If there are multiple processors, use commas (,) to separate them.

huawei.com/Ascend910-Unhealthy

Name of the unhealthy processor on the current node. If there are multiple processors, use commas (,) to separate them.

huawei.com/Ascend910-Fault

Array object, including fault_type, npu_name, large_model_fault_level, fault_level, fault_handling, fault_code, and fault_time_and_level_map.

- fault_type

Fault type.

  • CardUnhealthy: processor fault
  • CardNetworkUnhealthy: parameter plane network fault (processor network fault)
  • NodeUnhealthy: node fault
  • PublicFault: public fault

- npu_name

Name of the faulty processor. This parameter is left empty if the node is faulty.

- large_model_fault_level

Fault handling type. This parameter is left empty for node faults.

  • NotHandleFault: requires no handling.
  • RestartRequest: re-executes inference requests in the inference scenario, or re-executes training requests in the training scenario.
  • RestartBusiness: re-executes services.
  • FreeRestartNPU: resets an idle processor when faults affect service execution.
  • RestartNPU: directly resets processors and re-executes services.
  • SeparateNPU: isolates processors.
  • PreSeparateNPU: pre-isolates processors and determines whether to perform rescheduling based on the actual running status of the training job.
NOTE:
  • The functions of large_model_fault_level, fault_level and fault_handling are the same. fault_handling is recommended.
  • When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.

- fault_level

- fault_handling

- fault_code

Fault code, a string of characters separated by commas (,).

- fault_time_and_level_map

Fault code, fault occurrence time, and fault handling level.

SuperPodID

SuperPoD ID

ServerIndex

Relative position of the current node in a SuperPoD

NOTE:
  • When SuperPodID or ServerIndex reported by the driver is 0xffffffff, the corresponding value of SuperPodID or ServerIndex is -1.
  • The value of SuperPodID or ServerIndex is -2 in the following situations:
    • The current device does not support the query of SuperPoD information.
    • The SuperPoD information fails to be obtained due to a driver problem.

Interconnect Device Faults

Query command: kubectl describe cm -n mindx-dl cluster-info-switch-${m}

m is an integer starting from 0. Each time 2000 nodes are added to the cluster, a ConfigMap file named cluster-info-switch-${m} is added.

The following uses Atlas A3 training product as an example to illustrate the query result. Parameters in the command output may vary according to device types. For details about the key parameters, see Table 3.
{"FaultCode":[000001c1],"FaultLevel":"NotHandle","UpdateTime":1722845555,"NodeStatus":"Healthy"}
Table 3 Parameters of interconnect device faults

Parameter

Description

FaultCode

Fault code, a string of characters (hexadecimal) consisted by English characters and numbers.

FaultLevel

Policy for handling the fault at the highest level.

  • NotHandle: requires no handling.
  • SubHealth: the handling is determined based on the configured policy.
  • Reset: isolates a node.
  • Separate: isolates a node.
  • Separate: isolates a node.

UpdateTime

Time when ConfigMap is updated.

NodeStatus

Status of the current node.

  • Healthy: The node is healthy.
  • SubHealthy: The node is pre-isolated. The system does not process the jobs and does not schedule the jobs to the node.
  • UnHealthy: The node is unhealthy. Isolate the node and reschedule jobs.