Configuration File Description
During resumable training, node hardware faults are handled based on their levels. NodeD obtains the fault code of the current fault and handles the fault based on the fault level configured in NodeDConfiguration.json. The following table describes the levels and handling methods of hardware faults.
The NodeDConfiguration.json file of NodeD is a system-level configuration file. Do not modify it unless otherwise required. To change the level of a fault code, use the mindx-dl-node-fault-config file created by NodeDConfiguration.json. For details, see (Optional) Configuring Node Hardware Fault Levels.
Fault Level |
Fault Handling Policy |
Description |
|---|---|---|
NotHandleFault |
Handling not required |
There is no impact on jobs. |
PreSeparateFault |
If there is a job running on the node, the fault is not handled, and no job is scheduled to the node. |
Jobs may be affected. |
SeparateFault |
Job rescheduling |
Jobs are affected. |
Note: Fault level priority: NotHandleFault < PreSeparateFault < SeparateFault |
||
Node Status |
Highest Fault Level |
Fault Handling Policy |
Description |
|---|---|---|---|
Healthy |
NotHandleFault |
Handling not required |
The node is healthy and can be trained properly. |
PreSeparate |
PreSeparateFault |
If there is a job running on the node, the fault will not be handled, and no job will be scheduled to the node. |
The node is subhealthy and may not affect jobs. After jobs are affected and exit, they will not be scheduled to the node. |
UnHealthy |
SeparateFault |
Job rescheduling |
The node is faulty and training jobs will be affected. The jobs should be transferred out of the node immediately. |
Note:
|
|||