Public Faults

Public faults refer to faults reported by other fault senders (non-MindCluster components), including NPU faults, node faults, network faults, and storage faults.

To receive public faults, Ascend Device Plugin must be installed on a node and device-info-cm must be generated.

Reporting Mechanism

Upon fault detection, the public fault sender transmits the fault details to ClusterD through ConfigMap or gRPC. ClusterD summarizes the received information, writes it to cluster-info-device-cm, and reports it to Ascend-volcano-plugin.

  • ConfigMap: The fault discoverer writes fault information into a ConfigMap, and ClusterD obtains the fault information. You can call the ConfigMap interface to inject public faults by referring to ConfigMap.
  • gRPC: The fault discoverer sends fault information to ClusterD through gRPC, and ClusterD obtains the fault information. You can call the gRPC interface to inject public faults by referring to gRPC.
Figure 1 Reporting public faults

Required Components

To ensure the normal use of public fault detection, install the following components.

  • Mandatory components: Volcano, Ascend Operator, Ascend Device Plugin, and ClusterD
  • Optional component: NodeD

Supported Fault Handling Types

Include job-level rescheduling, pod-level rescheduling, and process-level rescheduling.

(Optional) Configuring the Fault Detection Level and Sender

Resumable training provides the default fault level and supported fault sender for public faults. If you want to modify the fault level and fault sender of public faults, see Public Faults. However, do not change it unless otherwise specified.