Cluster Resources

ConfigMap Description

After ClusterD is started, the following ConfigMaps are created:

cluster-info-node-cm (see Table 1).
cluster-info-device-${m} (see Table 2). m is an integer starting from 0. Each time 1000 nodes are added to the cluster, a ConfigMap file named cluster-info-device-${m} is added.
cluster-info-switch-${x} (see Table 3). x is an integer starting from 0. Each time 2000 nodes are added to the cluster, a ConfigMap file named cluster-info-switch-${x} is added.

**Table 1** cluster-info-node-cm
Parameter	Description
mindx-dl-nodeinfo-<kwok-node-0>	The prefix is fixed to mindx-dl-nodeinfo, and kwok-node-0 is the node name for fault locating.
NodeInfo	Node fault information
FaultDevList	List of faulty devices on a node
- DeviceType	Faulty device type
- DeviceId	ID of the faulty device
- FaultCode	Fault code, a string of characters (hexadecimal) consisted by English characters and numbers.
- FaultLevel	Fault handling level. NotHandleFault: requires no handling. PreSeparateFault: if there is a job running on the node, the fault is not handled, and no job is scheduled to the node. SeparateFault: job rescheduling
NodeStatus	Node health status, which is determined by the device with the highest fault handling level on the node. Healthy: The fault handling level on the node does not exceed NotHandleFault. The node is a healthy node and can be trained normally. If the fault handling level on the node is PreSeparateFault and NPUs are being used on the node, the node is deemed healthy. After the job is complete, the node becomes a faulty node. UnHealthy: The fault handling level on the node is SeparateFault. The node is a faulty node and affects training. Jobs should be transferred immediately out of the node. If the fault handling level of the node is PreSeparateFault and no NPU is being used, the node is a faulty node and other jobs cannot be scheduled to this node.

**Table 2** cluster-info-device-${m}
Parameter	Description
mindx-dl-deviceinfo-<kwok-node-0>	The prefix is fixed to mindx-dl-deviceinfo, and kwok-node-0 is the node name for locating the faulty node.
huawei.com/Ascend910	Name of the available processor on the current node. If there are multiple processors, use commas (,) to separate them.
huawei.com/Ascend910-NetworkUnhealthy	Name of the processor with unhealthy network on the current node. If there are multiple processors, use commas (,) to separate them.
huawei.com/Ascend910-Unhealthy	Name of the unhealthy processor on the current node. If there are multiple processors, use commas (,) to separate them.
huawei.com/Ascend910-Fault	Array object, including fault_type, npu_name, large_model_fault_level, fault_level, fault_handling, fault_code, and fault_time_and_level_map.
- fault_type	Fault type. CardUnhealthy: processor fault CardNetworkUnhealthy: parameter plane network fault (processor network fault) NodeUnhealthy: node fault PublicFault: public fault
- npu_name	Name of the faulty processor. This parameter is left empty if the node is faulty.
- large_model_fault_level	Fault handling type. This parameter is left empty for node faults. NotHandleFault: requires no handling. RestartRequest: re-executes inference requests in the inference scenario, or re-executes training requests in the training scenario. RestartBusiness: re-executes services. FreeRestartNPU: resets an idle processor when faults affect service execution. RestartNPU: directly resets processors and re-executes services. SeparateNPU: isolates processors. PreSeparateNPU: pre-isolates processors and determines whether to perform rescheduling based on the actual running status of the training job. NOTE: The functions of large_model_fault_level, fault_level and fault_handling are the same. fault_handling is recommended. When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.
- fault_level
- fault_handling
- fault_code	Fault code, a string of characters separated by commas (,).
- fault_time_and_level_map	Fault code, fault occurrence time, and fault handling level.
SuperPodID	SuperPoD ID
ServerIndex	Relative position of the current node in a SuperPoD NOTE: When SuperPodID or ServerIndex reported by the driver is 0xffffffff, the corresponding value of SuperPodID or ServerIndex is -1. The value of SuperPodID or ServerIndex is -2 in the following situations: The current device does not support the query of SuperPoD information. The SuperPoD information fails to be obtained due to a driver problem.

**Table 3** cluster-info-switch-${x}
Parameter	Description
FaultCode	List of interconnect device fault codes on the current node. Array object, including the EventType, AssembledFaultCode, PeerPortDevice, PeerPortId, SwitchChipId, SwitchPortId, Severity, Assertion, and AlarmRaisedTime fields.
-EventType	Alarm ID
-AssembledFaultCode	Fault code
-PeerPortDevice	Type of the device interconnected 0: CPU 1: NPU 2: SW 0xFFFF: N/A
-PeerPortId	ID of the interconnected device
-SwitchChipId	ID of the faulty UnifiedBus chip; numbered from 0.
-SwitchPortId	ID of the faulty UnifiedBus port; numbered from 0.
-Severity	Fault severity 0: warning 1: minor 2: major 3: critical
-Assertion	Event type 0: fault recovered 1: fault occurred 2: notification event
-AlarmRaisedTime	Time when a fault or event occurs
FaultLevel	Fault handling level of the current node. The highest fault level in FaultCode is used. The options are NotHandle, SubHealthFault, Separate, and RestartRequest.
UpdateTime	Update time when a fault is reported.
NodeStatus	Health status of the current node. Corresponds to the value of FaultLevel, including NotHandle: Healthy, SubHealthFault: SubHealthy, Separate: UnHealthy, and RestartRequest: UnHealthy.
FaultTimeAndLevelMap	List of fault occurrence time and fault handling levels. Array object, including fault code, faulty UnifiedBus chip ID, faulty UnifiedBus port ID, fault_time, and fault_level. The key consists of the fault code, faulty UnifiedBus chip ID, and faulty UnifiedBus port ID, which are connected by underscores (_).
-fault_time	Time when a fault occurs
-fault_level	Fault handling level

statistic-fault-info

This ConfigMap is located in the cluster-system namespace created by the user, and the label is mc-statistic-fault=true. It is used to display the public fault information in a cluster.

**Table 4** Data description
Parameter	Description
PublicFaults	Public fault details. If the number of faults is too large, this field is not updated. For details about the following fields, see Table 1.
-<node name>	Faulty node.
-resource	Fault sender. The default value can be CCAE, fd-online, pingmesh, or Netmind.
-devIds	Physical ID of the faulty processor.
-faultId	Fault instance ID.
-type	Fault type. NPU: processor fault Node: node fault Network: network fault Storage: storage fault
-faultCode	Fault code.
-level	Fault level. NotHandleFault: No handling is required. SubHealthFault: sub-health SeparateNPU: Faults that cannot be rectified. Corresponding processors need to be isolated. PreSeparateNPU: services are not affected temporarily. Jobs will not be scheduled to the processor.
-faultTime	Time when a fault occurs.
FaultNum	Number of faults.
-publicFaultNum	Total number of public faults on all nodes.
Description	Message displayed when the number of public faults is too large.
NOTE: A total of 1 MB of public fault data is displayed, comprising approximately 4,500 records. If the number of records exceeds 4,500, some data is not displayed. The Description field is added to the ConfigMap for prompts. The internal cache is running properly.

cluster-system super-pod-<super-pod-id>

This ConfigMap is located in the cluster-system namespace created by the user, and the label is app=pingmesh.

**Table 5** cluster-system super-pod-<super-pod-id>
Parameter	Description
app	Label key required by NodeD to identify ConfigMap. The value is pingmesh.
superPodDevice	Key of the SuperPoD information.
SuperPodID	SuperPoD ID.
NodeDeviceMap	Information about all nodes in a SuperPoD.
NodeName	Node name.
DeviceMap	Information about all NPUs on a node, displayed in the format of "physicID:superDeviceID".

fault-job-info

This ConfigMap is in the cluster-system namespace created by the user. It is used to display information about the faulty jobs that require forcible release of communication resources in a cluster. It takes effect only when process-level rescheduling is performed on the Atlas 900 A3 SuperPoD.

**Table 6** fault-job-info
Parameter	Description	Value
SdIds	SDID of the faulty card.	String sequence
NodeNames	Name of the node that requires forcible release of resources.	String sequence
FaultTimes	Time when a fault occurred.	64-bit integer
JobId	Job UID.	String

Parent topic: ClusterD