Configuration File Description

During resumable training, node hardware faults are handled based on their levels. NodeD obtains the fault code of the current fault and handles the fault based on the fault level configured in NodeDConfiguration.json. The following table describes the levels and handling methods of hardware faults.

The NodeDConfiguration.json file of NodeD is a system-level configuration file. Do not modify it unless otherwise required. To change the level of a fault code, use the mindx-dl-node-fault-config file created by NodeDConfiguration.json. For details, see (Optional) Configuring Node Hardware Fault Levels.

Table 1 Fault description

Fault Level

Fault Handling Policy

Description

NotHandleFault

Handling not required

There is no impact on jobs.

PreSeparateFault

If there is a job running on the node, the fault is not handled, and no job is scheduled to the node.

Jobs may be affected.

SeparateFault

Job rescheduling

Jobs are affected.

Note:

Fault level priority: NotHandleFault < PreSeparateFault < SeparateFault

Table 2 Node status description

Node Status

Highest Fault Level

Fault Handling Policy

Description

Healthy

NotHandleFault

Handling not required

The node is healthy and can be trained properly.

PreSeparate

PreSeparateFault

If there is a job running on the node, the fault will not be handled, and no job will be scheduled to the node.

The node is subhealthy and may not affect jobs. After jobs are affected and exit, they will not be scheduled to the node.

UnHealthy

SeparateFault

Job rescheduling

The node is faulty and training jobs will be affected. The jobs should be transferred out of the node immediately.

Note:

  • The health status of a node is determined based on the highest level of the hardware fault on the node.
  • Healthy, PreSeparate, and UnHealthy are node status customized by MindCluster and are used for subsequent job scheduling and processing.
  • For more details about node status and hardware fault information, see Query the Reported Fault Information.