Processor Faults

Processor faults refer to base software faults and processor hardware faults of NPUs. With resumable training, processor faults are detected and reported by the device manager Ascend Device Plugin.

NPU Reporting Mechanism

When an NPU is faulty, the fault management framework obtains the fault information and uploads it to the NPU driver's fault management framework. Upon receiving the fault information, the framework reports the fault to Ascend Device Plugin through the DCMI, as shown in Figure 1.

Ascend Device Plugin obtains the processor health status through the DCMI. Currently, the following two obtaining modes are provided:

Fault subscription mode: Upon startup, Ascend Device Plugin calls the fault subscription DCMI for monitoring. When a fault occurs, the driver reports the fault event to Ascend Device Plugin through this interface. After the fault is rectified, the driver reports the rectification event to Ascend Device Plugin also via this interface.
Fault polling mode: This mode queries processor fault status through a fault query interface at a fixed interval. If the device driver does not support subscription, this mode is used.

Figure 1 Processor fault reporting

Ascend Device Plugin Reporting Mechanism

After obtaining the processor fault information, Ascend Device Plugin reports the information to Kubernetes though a ConfigMap. The following figure shows the Ascend Device Plugin fault reporting mechanism.

Figure 2 Fault reporting to Kubernetes

The reporting path varies according to the fault handling mode.

Rescheduling mode: After obtaining a processor fault, Ascend Device Plugin writes the fault information to device-info-cm of the node. For details about the fields, see Table 1. ClusterD reads device-info-cm of each node to detect the processor fault and report it to the scheduler.
Graceful fault tolerance mode: After Ascend Device Plugin detects a recoverable processor fault, it writes the fault information to reset-info-cm of the current job. The service container can obtain the processor fault by mounting reset-info-cm as a file and reading the file.

If the fault fails to be rectified in graceful fault tolerance mode, the fault is reported in rescheduling mode.

Required Components

To ensure the normal use of processor fault detection, install Volcano, Ascend Operator, Ascend Device Plugin, and ClusterD.

(Optional) Configuring the Fault Detection Level

Resumable training provides the default fault frequency, duration, fault level, and fault handling policy for processor faults. If you want to modify the fault handling policy, see Processor Faults. However, do not change it unless otherwise specified.

Supported Fault Handling Types

Include job-level rescheduling, pod-level rescheduling, process-level rescheduling, process-level online recovery, and graceful fault tolerance.

Process-level online recovery is available exclusively for on-chip memory uncorrectable errors.

Parent topic: Fault Detection