Parameter Plane Network Faults

NPU parameter plane network faults include processor network faults and interconnect device faults.

If the parameter plane network is faulty, training jobs will be interrupted or training job performance will be poor. Upon interconnect device faults, MindCluster cluster scheduling components perform rescheduling based on the fault level.

Job rescheduling is not directly triggered by parameter plane network faults. Instead, it is triggered only when a training job is interrupted unexpectedly due to parameter plane faults.
To rectify parameter plane network faults, you need to enable unconditional retry upon service plane faults. To enable this function, you need to configure the fault-retry-times, restartPolicy, and policies parameters in the job YAML file. For details about the parameters, see YAML Parameters.

Ascend Device Plugin is responsible for detecting parameter plane network faults. Figure 1 shows the fault detection mechanism.

Figure 1 Fault detection

Key Steps

Processor Network Faults

Each NPU periodically checks whether the communication with the gateway address is normal at an interval of 2.5 seconds and reports the result through the fault management framework.
The RoCE driver monitors the link status of the NPU network port in real time and reports linkdown or linkup events through the fault management framework.
Ascend Device Plugin obtains information from the fault management framework through the DCMI, queries the gateway detection result in polling mode, and subscribes to and report network port linkdown or linkup events in real time. Ascend Device Plugin collects statistics on the duration of gateway detection exceptions and linkdown events. If the duration is less than or equal to the RoCE network timeout interval (20 seconds by default), the fault is marked as an NPU network fault (not handled by default, which may cause a parameter plane network fault). If the duration is greater than 20 seconds, refer to the preset fault level.

UnifiedBus Interconnect Device Faults

The UnifiedBus interconnect device writes the fault that occurs on the device into a local queue.
The UnifiedBus query interface queries the queue, caches the fault to the query interface, and summarizes the fault for processing.
Ascend Device Plugin calls the interface to obtain faults related to the UnifiedBus interconnect device through subscription or polling, and writes the faults to device-info-cm for reporting.

Fault Reporting Mechanism

If a network fault occurs on the processor, the NPU fault management framework obtains the fault information and reports the information to the NPU driver. After receiving the fault information, the NPU driver reports the information to Ascend Device Plugin through the DCMI. Ascend Device Plugin obtains the processor health status through the DCMI. Currently, the following two modes are provided for obtaining information:
- Fault subscription mode: When Ascend Device Plugin is started, the fault subscription DCMI is called for monitoring registration. When a fault occurs or is rectified, the driver reports the fault occurrence or rectification event to Ascend Device Plugin through this interface.
- Fault polling mode: The fault query interface is used to query the processor fault status at a fixed interval. If the device driver does not support subscription, the mode is used.
When the UnifiedBus interconnect device is faulty, MindCluster obtains the fault information through the UnifiedBus query interface. Currently, two fault query modes are provided:
- Fault subscription mode: When Ascend Device Plugin is started, the fault handling callback is registered with the UnifiedBusquery interface. After a fault occurs, the callback reports the fault to Ascend Device Plugin. When the fault is rectified, the recovery event is reported through this interface.
- Fault polling mode: Ascend Device Plugin calls the full fault query interface every 5 minutes.

Ascend Device Plugin Reporting Mechanism

After detecting a parameter plane network fault, Ascend Device Plugin writes the fault information to device-info-cm and reports the fault information to Kubernetes in the ConfigMap format. For details about the fields in device-info-cm, see Table 1.

Figure 2 shows the fault reporting mechanism of Ascend Device Plugin.

Figure 2 Fault reporting

Watchdog Fault Detection

If the parameter plane network link is abnormal (the parameter plane network is faulty), the normal NPU in a job may fail to communicate with the faulty NPU. Consequently, the collective communication of all NPUs enters a waiting timeout state. The collective communication for the job exits only after a waiting timeout exception occurs, which is 30 minutes by default.

If watchdog is enabled (and unconditional retry upon service plane faults is enabled), the faulty NPU can be isolated after the parameter plane network link is abnormal, and the job can be rescheduled to a healthy NPU. In this way, the job can exit within 6 minutes.

Watchdog can be used only in the PyTorch and MindSpore frameworks.

Required Components

To ensure the normal use of parameter plane network fault detection, install Volcano, Ascend Operator, Ascend Device Plugin, and ClusterD.

Supported Fault Handling Types

Include job-level rescheduling, pod-level rescheduling, and process-level rescheduling.

(Optional) Configuring the Fault Detection Level

Resumable training provides the default fault level and fault handling policy for parameter plane faults. If you need to modify the fault handling policy, see Parameter Plane Network Faults. However, do not change it unless otherwise specified.

Parent topic: Fault Detection