Fault Description

Resumable training identifies fault states in the cluster and training services using fault detection mechanisms and resolves issues based on the detection results. Currently, this feature supports fault detection for Ascend hardware faults, training service faults, and other faults.

Among MindCluster cluster scheduling components, Ascend Device Plugin detects NPU faults and NPU parameter plane network faults; NodeD detects server faults, DPC shared storage faults, and UnifiedBus network faults; ClusterD detects public faults; and Volcano detects container exceptions on the service plane. The following figure shows the overall fault detection architecture.

Ascend Device Plugin, deployed on the compute server, obtains the NPU fault and parameter plane network fault information through the driver, and reports the fault information to the management server.
NodeD, deployed on the compute server, obtains the information about server faults, DPC shared storage faults, and UnifiedBus network faults through the driver, and reports the fault information to the management server.
Kubernetes on the compute server monitors the training container status. If an abnormality occurs, the fault is reported to Kubernetes, and Volcano, deployed on the management server, retrieves the fault information through Kubernetes.
ClusterD, deployed on the management server, obtains public faults through the public fault interface, summarizes the received information, and writes the information to cluster-info-device-cm.
(Optional) ClusterD, deployed on the management server, summarizes the fault information reported by Ascend Device Plugin and NodeD within a cluster.

Supported Fault Types

Currently, more than 200 faults can be detected. For details about the fault types, see Table 1. For details about the fault description, see Typical Faults.

**Table 1** Fault types
Fault Type	Fault Description
Node faults	Include node health status, node hardware faults, and DPC shared storage faults. For details about fault codes, see Node Fault Code Reference Documents. NOTE: If a node breaks down or restarts due to a hardware fault, NodeD cannot detect the fault type and report the fault.
Processor faults	Include processor faults reported by the DCMI and processor network faults detected by hccn_tool (device network detection tool). For details about fault code, see Processor Fault Code Reference Documents.
Parameter plane network faults	Include processor network faults and UnifiedBus interconnect device faults. Processor network faults: The dedicated network for parameter exchange between processors is faulty. For example, the NPU network port is faulty. (Atlas A3 training product) UnifiedBus interconnect device faults
Service plane faults	The training job exits abnormally, and the pod status becomes Failed. NOTE: You can run the *kubectl describe pod {Pod name}* -n {NAMESPACE} \|grep Status: command to check whether the pod status is Failed. Command output: Status: Failed**
Public faults	Refer to faults reported by other fault senders (non-MindCluster components), including NPU faults, node faults, network faults, and storage faults.
Pingmesh UnifiedBus network faults	Refer to NPU network faults detected on the HCCS network within or across SuperPoDs.
Performance degradation	MindCluster provides the diagnosis function for performance degradation (slow nodes) in a cluster based on the profiling capability provided by MindStudio. This function provides the capability of dynamic dotting and data persistence, allowing dotting to be enabled or disabled in real time without requiring job restart for diagnosis, ensuring uninterrupted training.

ConfigMap Overview

Ascend Device Plugin of each compute node creates a ConfigMap file to record the NPU and UnifiedBus interconnect device information of the node. The ConfigMap file is named mindx-dl-deviceinfo-<nodename>, referred to as device-info-cm. Fault information is reported through this ConfigMap. For details about the fields in the ConfigMap file, see Table 1.
When a fault occurs on a node, NodeD on each compute node creates a ConfigMap file that records the device fault information of the node. The ConfigMap file is named mindx-dl-nodeinfo-<nodename>, referred to as node-info-cm. Fault information is reported through this ConfigMap. For details about the fields in the ConfigMap file, see Table 1.
ClusterD creates a ConfigMap file for recording device information in a cluster. The ConfigMap file name is cluster-info-<device/switch>-<[0-5]> or cluster-info-node-cm (cluster-info-cm for short). Node and processor fault information is reported through cluster-info-cm.
When creating each job, you need to configure a ConfigMap file in the YAML file. The name of the ConfigMap file is reset-config-<job-name>, referred to as reset-info-cm. The ConfigMap file is mounted to the /user/restore/reset/config directory of the container. Ascend Device Plugin automatically mounts the ConfigMap to the /user/restore/reset/<job-namespace>.<job-name> directory of the node.
You can also replace ConfigMap with /user/restore/reset/<job-namespace>.<job-name> on the node and mount it to the /user/restore/reset/config directory of the container. For details about the fields in the ConfigMap file, see Table 2.

Parent topic: Fault Detection