Cluster Resources
ConfigMap Description
After ClusterD is started, the following ConfigMaps are created:
- cluster-info-node-cm (see Table 1).
- cluster-info-device-${m} (see Table 2). m is an integer starting from 0. Each time 1000 nodes are added to the cluster, a ConfigMap file named cluster-info-device-${m} is added.
- cluster-info-switch-${x} (see Table 3). x is an integer starting from 0. Each time 2000 nodes are added to the cluster, a ConfigMap file named cluster-info-switch-${x} is added.
Parameter |
Description |
|---|---|
mindx-dl-nodeinfo-<kwok-node-0> |
The prefix is fixed to mindx-dl-nodeinfo, and kwok-node-0 is the node name for fault locating. |
NodeInfo |
Node fault information |
FaultDevList |
List of faulty devices on a node |
- DeviceType |
Faulty device type |
- DeviceId |
ID of the faulty device |
- FaultCode |
Fault code, a string of characters (hexadecimal) consisted by English characters and numbers. |
- FaultLevel |
Fault handling level.
|
NodeStatus |
Node health status, which is determined by the device with the highest fault handling level on the node.
|
Parameter |
Description |
|---|---|
mindx-dl-deviceinfo-<kwok-node-0> |
The prefix is fixed to mindx-dl-deviceinfo, and kwok-node-0 is the node name for locating the faulty node. |
huawei.com/Ascend910 |
Name of the available processor on the current node. If there are multiple processors, use commas (,) to separate them. |
huawei.com/Ascend910-NetworkUnhealthy |
Name of the processor with unhealthy network on the current node. If there are multiple processors, use commas (,) to separate them. |
huawei.com/Ascend910-Unhealthy |
Name of the unhealthy processor on the current node. If there are multiple processors, use commas (,) to separate them. |
huawei.com/Ascend910-Fault |
Array object, including fault_type, npu_name, large_model_fault_level, fault_level, fault_handling, fault_code, and fault_time_and_level_map. |
- fault_type |
Fault type.
|
- npu_name |
Name of the faulty processor. This parameter is left empty if the node is faulty. |
- large_model_fault_level |
Fault handling type. This parameter is left empty for node faults.
NOTE:
|
- fault_level |
|
- fault_handling |
|
- fault_code |
Fault code, a string of characters separated by commas (,). |
- fault_time_and_level_map |
Fault code, fault occurrence time, and fault handling level. |
SuperPodID |
SuperPoD ID |
ServerIndex |
Relative position of the current node in a SuperPoD NOTE:
|
Parameter |
Description |
|---|---|
FaultCode |
List of interconnect device fault codes on the current node. Array object, including the EventType, AssembledFaultCode, PeerPortDevice, PeerPortId, SwitchChipId, SwitchPortId, Severity, Assertion, and AlarmRaisedTime fields. |
-EventType |
Alarm ID |
-AssembledFaultCode |
Fault code |
-PeerPortDevice |
Type of the device interconnected
|
-PeerPortId |
ID of the interconnected device |
-SwitchChipId |
ID of the faulty UnifiedBus chip; numbered from 0. |
-SwitchPortId |
ID of the faulty UnifiedBus port; numbered from 0. |
-Severity |
Fault severity
|
-Assertion |
Event type
|
-AlarmRaisedTime |
Time when a fault or event occurs |
FaultLevel |
Fault handling level of the current node. The highest fault level in FaultCode is used. The options are NotHandle, SubHealthFault, Separate, and RestartRequest. |
UpdateTime |
Update time when a fault is reported. |
NodeStatus |
Health status of the current node. Corresponds to the value of FaultLevel, including NotHandle: Healthy, SubHealthFault: SubHealthy, Separate: UnHealthy, and RestartRequest: UnHealthy. |
FaultTimeAndLevelMap |
List of fault occurrence time and fault handling levels. Array object, including fault code, faulty UnifiedBus chip ID, faulty UnifiedBus port ID, fault_time, and fault_level. The key consists of the fault code, faulty UnifiedBus chip ID, and faulty UnifiedBus port ID, which are connected by underscores (_). |
-fault_time |
Time when a fault occurs |
-fault_level |
Fault handling level |
statistic-fault-info
This ConfigMap is located in the cluster-system namespace created by the user, and the label is mc-statistic-fault=true. It is used to display the public fault information in a cluster.
Parameter |
Description |
|---|---|
PublicFaults |
Public fault details. If the number of faults is too large, this field is not updated. For details about the following fields, see Table 1. |
-<node name> |
Faulty node. |
-resource |
Fault sender. The default value can be CCAE, fd-online, pingmesh, or Netmind. |
-devIds |
Physical ID of the faulty processor. |
-faultId |
Fault instance ID. |
-type |
Fault type.
|
-faultCode |
Fault code. |
-level |
Fault level.
|
-faultTime |
Time when a fault occurs. |
FaultNum |
Number of faults. |
-publicFaultNum |
Total number of public faults on all nodes. |
Description |
Message displayed when the number of public faults is too large. |
NOTE:
|
|
cluster-system super-pod-<super-pod-id>
This ConfigMap is located in the cluster-system namespace created by the user, and the label is app=pingmesh.
Parameter |
Description |
|---|---|
app |
Label key required by NodeD to identify ConfigMap. The value is pingmesh. |
superPodDevice |
Key of the SuperPoD information. |
SuperPodID |
SuperPoD ID. |
NodeDeviceMap |
Information about all nodes in a SuperPoD. |
NodeName |
Node name. |
DeviceMap |
Information about all NPUs on a node, displayed in the format of "physicID:superDeviceID". |
fault-job-info
This ConfigMap is in the cluster-system namespace created by the user. It is used to display information about the faulty jobs that require forcible release of communication resources in a cluster. It takes effect only when process-level rescheduling is performed on the Atlas 900 A3 SuperPoD.
Parameter |
Description |
Value |
|---|---|---|
SdIds |
SDID of the faulty card. |
String sequence |
NodeNames |
Name of the node that requires forcible release of resources. |
String sequence |
FaultTimes |
Time when a fault occurred. |
64-bit integer |
JobId |
Job UID. |
String |