ConfigMap
Description
Receives ConfigMap information of public faults to connect to the resumable training process.
- If the actual parameter values in ConfigMap are different from the defined value ranges, ClusterD discards the fault information.
- When faults are injected through ConfigMap or gRPC interfaces, the maximum number of faults on all nodes is 50,000. If this threshold is exceeded, ClusterD discards any newly injected fault information.
- The label of ConfigMap must be mc-consumer-publicfault=true, and the data key must be PublicFault.
- When ConfigMap is used to send public faults, the data volume cannot exceed 1 MB. Otherwise, ConfigMap fails to be updated.
Parameters
For details, see the following table.
Parameter |
Meaning |
Value |
Type |
Mandatory or Not |
|---|---|---|---|---|
id |
Unique ID of a message |
A string of 8 to 128 characters, including uppercase letters, lowercase letters, digits, hyphens (-), underscores (_), and periods (.). The value must be unique. |
String |
Yes |
timestamp |
Timestamp for message sending |
The value is a 13-digit number in ms and must be later than 2025-01-01T00:00:00Z. |
Int64 |
Yes |
version |
Message version |
The value is 1.0. |
String |
Yes |
resource |
Fault sender |
The value can be CCAE, fd-online, pingmesh, Netmind, or dpcStorage. NOTE:
|
String |
Yes |
faults |
Fault details |
The value is a slice, whose length is greater than 0 and less than or equal to 100. |
[]object, fault |
Yes |
Parameter |
Meaning |
Value |
Type |
Mandatory or Not |
|---|---|---|---|---|
faultId |
Fault instance ID |
A string of 8 to 128 characters, including uppercase letters, lowercase letters, digits, hyphens (-), underscores (_), and periods (.). The value must be unique. NOTE:
Even for the same fault instance, the value of faultId must be unique. |
String |
Yes |
faultType |
Fault type |
The value can be NPU, Node, Network, or Storage.
|
String |
Yes |
faultCode |
Fault code |
The value can be customized and must be unique. The value consists of nine characters. For details, see Fault Code Description. NOTE:
|
String |
Yes |
faultTime |
Time when a fault occurs |
The value is a 13-digit number in ms and must be later than 2025-01-01T00:00:00Z. NOTE:
|
Int64 |
Yes |
assertion |
Fault status |
The value can be occur, recover, or once.
|
String |
Yes |
faultLocation |
Fault location |
Fault source, with a length less than or equal to 10 characters. The length of the key in the map is less than or equal to 16 characters, and the length of the value is less than or equal to 128 characters. For example: key: npuIp, value: ip |
map[string]string |
No |
influence |
Fault impact scope |
The value is a slice, whose length is greater than 0 and less than or equal to 1000. |
[]object, faultInfo |
Yes |
description |
Fault description |
The value is a string of 0 to 512 characters, containing non-whitespace characters and spaces. |
String |
No |
Field |
Meaning |
Value |
Type |
Mandatory or Not |
|---|---|---|---|---|
nodeName |
Node name. You can run the kubectl get nodes -owide command to query the node name. |
The value is a string of 1 to 253 characters, including lowercase letters, digits, hyphens (-), and periods (.). It must start and end with a letter or digit. If this field exists, nodeSN is not used. NOTE:
If the node name does not exist in a Kubernetes cluster, ClusterD does not display a message indicating that the node name is incorrect, nor does it write the fault information into cluster-info-device-cm. |
String |
Alternative |
nodeSN |
Node SN |
The value is the node annotation written by NodeD. The key is product-serial-number. NOTE:
If this field is used instead of nodeName, install NodeD in advance. |
String |
|
deviceIds |
Physical processor ID |
The length range is (0, 32], and element range is [0, 32). The values must be unique. NOTE:
|
[]int32 |
Yes |