Configuration Description
Resumable training provides default fault levels and fault handling policies for different fault codes of node hardware faults, processor faults, UnifiedBus interconnect device faults, and public faults. For processor faults, it also provides default fault frequency and duration, and corresponding fault handling policies.
This section describes how to modify fault handling policies. Do not modify the them unless otherwise specified.
Fault Level Description
The following table lists the fault levels supported by different types of faults.
Fault Type |
Supported Fault Level |
||
|---|---|---|---|
Node faults |
NotHandleFault, PreSeparateFault, SeparateFault |
||
Processor faults |
NotHandleFault, RestartRequest, RestartBusiness, FreeRestartNPU, RestartNPU, SeparateNPU, PreSeparateNPU, SubHealthFault |
||
UnifiedBus interconnect device faults |
NotHandleFault, SubHealthFault, ResetFault, SeparateFault, RestartRequestFault |
||
Public faults |
NotHandleFault, SeparateNPU, SubHealthFault, PreSeparateNPU |
||
The following table describes fault handling policies of each fault level.
Policy |
Description |
Rescheduling |
Graceful Fault Tolerance |
|---|---|---|---|
NotHandleFault |
If a fault does not affect services, no action is required. |
No action is required. |
No action is required. |
RestartRequest |
If service execution is affected, the service request needs to be executed again. |
Isolate the processor and reschedule the job. NOTE:
When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled. |
Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario. |
RestartBusiness |
If service execution is affected, the service needs to be executed again. |
Re-execute the service. |
|
FreeRestartNPU |
If service execution is affected, the processor needs to be reset when it is idle. |
Reset the processor when it is idle. |
|
RestartNPU |
If service execution is affected, the processor needs to be reset immediately. |
Stop the training service immediately, reset the processor, and re-execute the service. |
|
SeparateNPU |
The fault cannot be rectified, and the processor needs to be isolated. |
Isolate the processor and reschedule the job. |
|
SeparateFault |
The job will be affected. NOTE:
If the level of the UnifiedBus interconnect device faults is SeparateFault, the service fails to run. In this case, you need to replace the component or board. |
Job rescheduling is triggered NOTE:
If the UnifiedBus interconnect device is faulty, this fault handling policy stops the current training job, isolate the node, and reschedule the job. |
-- |
RestartRequestFault |
If the service fails to run, the service request needs to be executed again. |
Stop the current training job, isolate the node, and reschedule the job. |
Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario. |
ResetFault |
The service fails to run. |
Stop the current training job, isolate the node, and reschedule the job. |
-- |
PreSeparateNPU |
Services are not affected for the time being. No job will be scheduled to the processor. |
Pre-isolate the processor. |
Pre-isolate the processor. |
PreSeparateFault |
The job may be affected. |
If the job is running on the node, the system does not rectify the fault and does not schedule other jobs to the node. |
-- |
SubHealthFault |
See subHealthyStrategy in the job YAML file (Table 1). |
If the processor is subhealthy, rectify the fault based on the YAML configuration. NOTE:
If a fault of another severity level occurs on the processor, SubHealthFault does not affect the handling of that fault. |
Perform operations based on the policy. |