Custom Node Faults

The NodeDConfiguration.json file of NodeD is a system-level configuration file. Do not modify it unless otherwise required. To change the level of a fault code, use the mindx-dl-node-fault-config file created by NodeDConfiguration.json. For details, see (Optional) Configuring Node Hardware Fault Levels.

Table 1 Fault description

Fault Level

Fault Handling Policy

Description

NotHandleFault

No action is required.

There is no impact on jobs.

PreSeparateFault

If there is a job running on the node, the fault is not handled, and no job is scheduled to the node.

Jobs may be affected.

SeparateFault

Job rescheduling

Jobs are affected.

Note:

Fault level priority: NotHandleFault < PreSeparateFault < SeparateFault

Table 2 Node status description

Node Status

Highest Fault Level

Fault Handling Policy

Description

Healthy

NotHandleFault

No action is required.

The node is healthy and can be trained properly.

PreSeparate

PreSeparateFault

If there is a job running on the node, the fault is not handled, and no job is scheduled to the node.

The node is subhealthy and may not affect jobs. After jobs are affected and exit, they will not be scheduled to the node.

UnHealthy

SeparateFault

Job rescheduling

The node is faulty and training jobs will be affected. The jobs should be transferred out of the node immediately.

Note:

  • The health status of a node is determined based on the highest level of the hardware fault on the node.
  • Healthy, PreSeparate, and UnHealthy are node status customized by MindCluster and are used for subsequent job scheduling and processing.
  • If resumable training is required after a job on the PreSeparate node exits abnormally, enable the unconditional retry function.