Cluster Resources

ConfigMap Description

After ClusterD is started, the following ConfigMaps are created:

  • cluster-info-node-cm (see Table 1).
  • cluster-info-device-${m} (see Table 2). m is an integer starting from 0. Each time 1000 nodes are added to the cluster, a ConfigMap file named cluster-info-device-${m} is added.
  • cluster-info-switch-${x} (see Table 3). x is an integer starting from 0. Each time 2000 nodes are added to the cluster, a ConfigMap file named cluster-info-switch-${x} is added.
Table 1 cluster-info-node-cm

Parameter

Description

mindx-dl-nodeinfo-<kwok-node-0>

The prefix is fixed to mindx-dl-nodeinfo, and kwok-node-0 is the node name for fault locating.

NodeInfo

Node fault information

FaultDevList

List of faulty devices on a node

- DeviceType

Faulty device type

- DeviceId

ID of the faulty device

- FaultCode

Fault code, a string of characters (hexadecimal) consisted by English characters and numbers.

- FaultLevel

Fault handling level.

  • NotHandleFault: requires no handling.
  • PreSeparateFault: if there is a job running on the node, the fault is not handled, and no job is scheduled to the node.
  • SeparateFault: job rescheduling

NodeStatus

Node health status, which is determined by the device with the highest fault handling level on the node.

  • Healthy: The fault handling level on the node does not exceed NotHandleFault. The node is a healthy node and can be trained normally. If the fault handling level on the node is PreSeparateFault and NPUs are being used on the node, the node is deemed healthy. After the job is complete, the node becomes a faulty node.
  • UnHealthy: The fault handling level on the node is SeparateFault. The node is a faulty node and affects training. Jobs should be transferred immediately out of the node. If the fault handling level of the node is PreSeparateFault and no NPU is being used, the node is a faulty node and other jobs cannot be scheduled to this node.
Table 2 cluster-info-device-${m}

Parameter

Description

mindx-dl-deviceinfo-<kwok-node-0>

The prefix is fixed to mindx-dl-deviceinfo, and kwok-node-0 is the node name for locating the faulty node.

huawei.com/Ascend910

Name of the available processor on the current node. If there are multiple processors, use commas (,) to separate them.

huawei.com/Ascend910-NetworkUnhealthy

Name of the processor with unhealthy network on the current node. If there are multiple processors, use commas (,) to separate them.

huawei.com/Ascend910-Unhealthy

Name of the unhealthy processor on the current node. If there are multiple processors, use commas (,) to separate them.

huawei.com/Ascend910-Fault

Array object, including fault_type, npu_name, large_model_fault_level, fault_level, fault_handling, fault_code, and fault_time_and_level_map.

- fault_type

Fault type.

  • CardUnhealthy: processor fault
  • CardNetworkUnhealthy: parameter plane network fault (processor network fault)
  • NodeUnhealthy: node fault
  • PublicFault: public fault

- npu_name

Name of the faulty processor. This parameter is left empty if the node is faulty.

- large_model_fault_level

Fault handling type. This parameter is left empty for node faults.

  • NotHandleFault: requires no handling.
  • RestartRequest: re-executes inference requests in the inference scenario, or re-executes training requests in the training scenario.
  • RestartBusiness: re-executes services.
  • FreeRestartNPU: resets an idle processor when faults affect service execution.
  • RestartNPU: directly resets processors and re-executes services.
  • SeparateNPU: isolates processors.
  • PreSeparateNPU: pre-isolates processors and determines whether to perform rescheduling based on the actual running status of the training job.
NOTE:
  • The functions of large_model_fault_level, fault_level and fault_handling are the same. fault_handling is recommended.
  • When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.

- fault_level

- fault_handling

- fault_code

Fault code, a string of characters separated by commas (,).

- fault_time_and_level_map

Fault code, fault occurrence time, and fault handling level.

SuperPodID

SuperPoD ID

ServerIndex

Relative position of the current node in a SuperPoD

NOTE:
  • When SuperPodID or ServerIndex reported by the driver is 0xffffffff, the corresponding value of SuperPodID or ServerIndex is -1.
  • The value of SuperPodID or ServerIndex is -2 in the following situations:
    • The current device does not support the query of SuperPoD information.
    • The SuperPoD information fails to be obtained due to a driver problem.
Table 3 cluster-info-switch-${x}

Parameter

Description

FaultCode

List of interconnect device fault codes on the current node.

Array object, including the EventType, AssembledFaultCode, PeerPortDevice, PeerPortId, SwitchChipId, SwitchPortId, Severity, Assertion, and AlarmRaisedTime fields.

-EventType

Alarm ID

-AssembledFaultCode

Fault code

-PeerPortDevice

Type of the device interconnected

  • 0: CPU
  • 1: NPU
  • 2: SW
  • 0xFFFF: N/A

-PeerPortId

ID of the interconnected device

-SwitchChipId

ID of the faulty UnifiedBus chip; numbered from 0.

-SwitchPortId

ID of the faulty UnifiedBus port; numbered from 0.

-Severity

Fault severity

  • 0: warning
  • 1: minor
  • 2: major
  • 3: critical

-Assertion

Event type

  • 0: fault recovered
  • 1: fault occurred
  • 2: notification event

-AlarmRaisedTime

Time when a fault or event occurs

FaultLevel

Fault handling level of the current node.

The highest fault level in FaultCode is used. The options are NotHandle, SubHealthFault, Separate, and RestartRequest.

UpdateTime

Update time when a fault is reported.

NodeStatus

Health status of the current node.

Corresponds to the value of FaultLevel, including NotHandle: Healthy, SubHealthFault: SubHealthy, Separate: UnHealthy, and RestartRequest: UnHealthy.

FaultTimeAndLevelMap

List of fault occurrence time and fault handling levels.

Array object, including fault code, faulty UnifiedBus chip ID, faulty UnifiedBus port ID, fault_time, and fault_level. The key consists of the fault code, faulty UnifiedBus chip ID, and faulty UnifiedBus port ID, which are connected by underscores (_).

-fault_time

Time when a fault occurs

-fault_level

Fault handling level

statistic-fault-info

This ConfigMap is located in the cluster-system namespace created by the user, and the label is mc-statistic-fault=true. It is used to display the public fault information in a cluster.

Table 4 Data description

Parameter

Description

PublicFaults

Public fault details. If the number of faults is too large, this field is not updated. For details about the following fields, see Table 1.

-<node name>

Faulty node.

-resource

Fault sender.

The default value can be CCAE, fd-online, pingmesh, or Netmind.

-devIds

Physical ID of the faulty processor.

-faultId

Fault instance ID.

-type

Fault type.

  • NPU: processor fault
  • Node: node fault
  • Network: network fault
  • Storage: storage fault

-faultCode

Fault code.

-level

Fault level.

  • NotHandleFault: No handling is required.
  • SubHealthFault: sub-health
  • SeparateNPU: Faults that cannot be rectified. Corresponding processors need to be isolated.
  • PreSeparateNPU: services are not affected temporarily. Jobs will not be scheduled to the processor.

-faultTime

Time when a fault occurs.

FaultNum

Number of faults.

-publicFaultNum

Total number of public faults on all nodes.

Description

Message displayed when the number of public faults is too large.

NOTE:
  • A total of 1 MB of public fault data is displayed, comprising approximately 4,500 records.
  • If the number of records exceeds 4,500, some data is not displayed. The Description field is added to the ConfigMap for prompts. The internal cache is running properly.

cluster-system super-pod-<super-pod-id>

This ConfigMap is located in the cluster-system namespace created by the user, and the label is app=pingmesh.

Table 5 cluster-system super-pod-<super-pod-id>

Parameter

Description

app

Label key required by NodeD to identify ConfigMap. The value is pingmesh.

superPodDevice

Key of the SuperPoD information.

SuperPodID

SuperPoD ID.

NodeDeviceMap

Information about all nodes in a SuperPoD.

NodeName

Node name.

DeviceMap

Information about all NPUs on a node, displayed in the format of "physicID:superDeviceID".

fault-job-info

This ConfigMap is in the cluster-system namespace created by the user. It is used to display information about the faulty jobs that require forcible release of communication resources in a cluster. It takes effect only when process-level rescheduling is performed on the Atlas 900 A3 SuperPoD.

Table 6 fault-job-info

Parameter

Description

Value

SdIds

SDID of the faulty card.

String sequence

NodeNames

Name of the node that requires forcible release of resources.

String sequence

FaultTimes

Time when a fault occurred.

64-bit integer

JobId

Job UID.

String