Node Faults

Node faults are detected by NodeD. These faults include node health status, node hardware faults, and DPC shared storage faults. The details are as follows:

  • Node health status

    After completing the node status diagnosis of a node, NodeD collects the fault information on the node. When a node is faulty, the node status reporting mechanism continuously sends the node status to Volcano (currently, only the hardware fault information on the node is collected).

  • Node hardware faults

    NodeD sends a fault query request to iBMC through the IPMI driver. Then iBMC responds to NodeD with the current hardware alarm. NodeD collects the hardware alarm and reports the node hardware status to Volcano.

  • DPC shared storage faults

    For nodes that use Scale-Out Storage DPC, you can use the noded-dpc-{version}.yaml file in the NodeD installation package to start the NodeD service and to enable detection and reporting of DPC process exceptions and insufficient memory exceptions.

    When a node is faulty, NodeD reports the node health status and node hardware faults. If no fault occurs, the node is considered healthy by default.

Figure 1 Node fault reporting
  • NodeD updates the node-info-cm content of a node within at least 5 seconds (by default) when a node fault occurs. For details about the fields, see mindx-dl-nodeinfo-<nodename>.
  • NodeD queries fault information from iBMC every 60 seconds (by default). If the interval between the time when fault information is queried from iBMC and the time when fault information is reported last time is longer than 30 minutes, the fault information is reported to node-info-cm within 1 second.

Required Components

To ensure the normal use of node fault detection, install Volcano, Ascend Operator, NodeD, and ClusterD.

Restrictions

  • The node hardware fault reporting capability of NodeD is supported only by Atlas 800T A2 training server/Atlas 900 A2 PoD cluster basic unit//Atlas 900 A3 SuperPoD/.
  • Only iBMC of V2 3.15.0.1 or later, or V2 3.10.02.55, and the product that has the IPMC driver installed support the node hardware fault reporting capability of NodeD. If iBMC or IPMI of an earlier version fails to obtain node fault information, only the node health status is reported.
  • To enable SuperPoD fault detection, use iBMC of V3 5.8.3.35 or later.
  • To enable DPC fault detection, use Scale-Out Storage DPC 24.2.0 or later.

Supported Fault Handling Types

Include job-level rescheduling, pod-level rescheduling, and process-level rescheduling.

(Optional) Configuring the Fault Detection Level

Resumable training provides default fault levels and fault handling policies for different fault codes of node hardware faults. If you want to modify the fault handling policy, see Node Hardware Faults. However, do not change it unless otherwise specified.