Public Faults
Public faults refer to faults reported by other fault senders (non-MindCluster components), including NPU faults, node faults, network faults, and storage faults.
To receive public faults, Ascend Device Plugin must be installed on a node and device-info-cm must be generated.
Reporting Mechanism
Upon fault detection, the public fault sender transmits the fault details to ClusterD through ConfigMap or gRPC. ClusterD summarizes the received information, writes it to cluster-info-device-cm, and reports it to Ascend-volcano-plugin.
- ConfigMap: The fault discoverer writes fault information into a ConfigMap, and ClusterD obtains the fault information. You can call the ConfigMap interface to inject public faults by referring to ConfigMap.
- gRPC: The fault discoverer sends fault information to ClusterD through gRPC, and ClusterD obtains the fault information. You can call the gRPC interface to inject public faults by referring to gRPC.

Required Components
To ensure the normal use of public fault detection, install the following components.
- Mandatory components: Volcano, Ascend Operator, Ascend Device Plugin, and ClusterD
- Optional component: NodeD
Supported Fault Handling Types
Include job-level rescheduling, pod-level rescheduling, and process-level rescheduling.
(Optional) Configuring the Fault Detection Level and Sender
Resumable training provides the default fault level and supported fault sender for public faults. If you want to modify the fault level and fault sender of public faults, see Public Faults. However, do not change it unless otherwise specified.