Custom Node Faults
The NodeDConfiguration.json file of NodeD is a system-level configuration file. Do not modify it unless otherwise required. To change the level of a fault code, use the mindx-dl-node-fault-config file created by NodeDConfiguration.json. For details, see (Optional) Configuring Node Hardware Fault Levels.
Fault Level |
Fault Handling Policy |
Description |
|---|---|---|
NotHandleFault |
No action is required. |
There is no impact on jobs. |
PreSeparateFault |
If there is a job running on the node, the fault is not handled, and no job is scheduled to the node. |
Jobs may be affected. |
SeparateFault |
Job rescheduling |
Jobs are affected. |
Note: Fault level priority: NotHandleFault < PreSeparateFault < SeparateFault |
||
Node Status |
Highest Fault Level |
Fault Handling Policy |
Description |
|---|---|---|---|
Healthy |
NotHandleFault |
No action is required. |
The node is healthy and can be trained properly. |
PreSeparate |
PreSeparateFault |
If there is a job running on the node, the fault is not handled, and no job is scheduled to the node. |
The node is subhealthy and may not affect jobs. After jobs are affected and exit, they will not be scheduled to the node. |
UnHealthy |
SeparateFault |
Job rescheduling |
The node is faulty and training jobs will be affected. The jobs should be transferred out of the node immediately. |
Note:
|
|||