ConfigMap

Description

Receives ConfigMap information of public faults to connect to the resumable training process.

  • If the actual parameter values in ConfigMap are different from the defined value ranges, ClusterD discards the fault information.
  • When faults are injected through ConfigMap or gRPC interfaces, the maximum number of faults on all nodes is 50,000. If this threshold is exceeded, ClusterD discards any newly injected fault information.
  • The label of ConfigMap must be mc-consumer-publicfault=true, and the data key must be PublicFault.
  • When ConfigMap is used to send public faults, the data volume cannot exceed 1 MB. Otherwise, ConfigMap fails to be updated.

Parameters

For details, see the following table.

Table 1 Fault description

Parameter

Meaning

Value

Type

Mandatory or Not

id

Unique ID of a message

A string of 8 to 128 characters, including uppercase letters, lowercase letters, digits, hyphens (-), underscores (_), and periods (.). The value must be unique.

String

Yes

timestamp

Timestamp for message sending

The value is a 13-digit number in ms and must be later than 2025-01-01T00:00:00Z.

Int64

Yes

version

Message version

The value is 1.0.

String

Yes

resource

Fault sender

The value can be CCAE, fd-online, pingmesh, Netmind, or dpcStorage.

NOTE:

String

Yes

faults

Fault details

The value is a slice, whose length is greater than 0 and less than or equal to 100.

[]object, fault

Yes

Table 2 Fault field description

Parameter

Meaning

Value

Type

Mandatory or Not

faultId

Fault instance ID

A string of 8 to 128 characters, including uppercase letters, lowercase letters, digits, hyphens (-), underscores (_), and periods (.). The value must be unique.

NOTE:

Even for the same fault instance, the value of faultId must be unique.

String

Yes

faultType

Fault type

The value can be NPU, Node, Network, or Storage.

  • NPU: processor fault
  • Node: node fault
  • Network: network fault
  • Storage: storage fault
    NOTE:

    This field is displayed as PublicFault in cluster-info-cm.

String

Yes

faultCode

Fault code

The value can be customized and must be unique. The value consists of nine characters. For details, see Fault Code Description.

NOTE:
  • The fault code for resumable training must exist in publicFaultCode of the fault configuration file.
  • For new fault codes, configure the fault level in the fault configuration file. For details, see (Optional) Configuring the Public Fault Level and Sender.
  • You are advised to define fault codes based on the fault code description table for subsequent maintenance.
  • If the same fault code is generated on an NPU twice, the fault_code field in cluster-info-cm records the same fault code twice.

String

Yes

faultTime

Time when a fault occurs

The value is a 13-digit number in ms and must be later than 2025-01-01T00:00:00Z.

NOTE:
  • This field specifies the fault generation time only, regardless of whether a fault is generated or rectified.
  • This field is displayed in the unit of seconds in cluster-info-cm.

Int64

Yes

assertion

Fault status

The value can be occur, recover, or once.

  • occur: fault occurrence
  • recover: fault recovery
  • once: one-off event
    NOTE:
    • To clear a public fault, write the recover event of the fault to ConfigMap. The clearance cannot be implemented by directly deleting ConfigMap.
    • For a one-off event, the fault is automatically cleared several seconds later.

String

Yes

faultLocation

Fault location

Fault source, with a length less than or equal to 10 characters. The length of the key in the map is less than or equal to 16 characters, and the length of the value is less than or equal to 128 characters.

For example: key: npuIp, value: ip

map[string]string

No

influence

Fault impact scope

The value is a slice, whose length is greater than 0 and less than or equal to 1000.

[]object, faultInfo

Yes

description

Fault description

The value is a string of 0 to 512 characters, containing non-whitespace characters and spaces.

String

No

Table 3 faultInfo description

Field

Meaning

Value

Type

Mandatory or Not

nodeName

Node name.

You can run the kubectl get nodes -owide command to query the node name.

The value is a string of 1 to 253 characters, including lowercase letters, digits, hyphens (-), and periods (.). It must start and end with a letter or digit. If this field exists, nodeSN is not used.

NOTE:

If the node name does not exist in a Kubernetes cluster, ClusterD does not display a message indicating that the node name is incorrect, nor does it write the fault information into cluster-info-device-cm.

String

Alternative

nodeSN

Node SN

The value is the node annotation written by NodeD. The key is product-serial-number.

NOTE:

If this field is used instead of nodeName, install NodeD in advance.

String

deviceIds

Physical processor ID

The length range is (0, 32], and element range is [0, 32). The values must be unique.

NOTE:
  • If the faulty processor cannot be accurately located, enter the physical IDs of all processors on the node.
  • If a physical ID of a processor that does not exist on a node is passed, ClusterD also displays its physical ID in cluster-info-device-cm.

[]int32

Yes