Configuration Description

Resumable training provides default fault levels and fault handling policies for different fault codes of node hardware faults, processor faults, UnifiedBus interconnect device faults, and public faults. For processor faults, it also provides default fault frequency and duration, and corresponding fault handling policies.

This section describes how to modify fault handling policies. Do not modify the them unless otherwise specified.

Fault Level Description

The following table lists the fault levels supported by different types of faults.

**Table 1** Fault levels
Fault Type	Supported Fault Level
Node faults	NotHandleFault, PreSeparateFault, SeparateFault
Processor faults	NotHandleFault, RestartRequest, RestartBusiness, FreeRestartNPU, RestartNPU, SeparateNPU, PreSeparateNPU, SubHealthFault
UnifiedBus interconnect device faults	NotHandleFault, SubHealthFault, ResetFault, SeparateFault, RestartRequestFault
Public faults	NotHandleFault, SeparateNPU, SubHealthFault, PreSeparateNPU

The following table describes fault handling policies of each fault level.

**Table 2** Fault severity and handling policies
Policy	Description	Rescheduling	Graceful Fault Tolerance
NotHandleFault	If a fault does not affect services, no action is required.	No action is required.	No action is required.
RestartRequest	If service execution is affected, the service request needs to be executed again.	Isolate the processor and reschedule the job. NOTE: When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.	Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario.
RestartBusiness	If service execution is affected, the service needs to be executed again.		Re-execute the service.
FreeRestartNPU	If service execution is affected, the processor needs to be reset when it is idle.		Reset the processor when it is idle.
RestartNPU	If service execution is affected, the processor needs to be reset immediately.		Stop the training service immediately, reset the processor, and re-execute the service.
SeparateNPU	The fault cannot be rectified, and the processor needs to be isolated.		Isolate the processor and reschedule the job.
SeparateFault	The job will be affected. NOTE: If the level of the UnifiedBus interconnect device faults is SeparateFault, the service fails to run. In this case, you need to replace the component or board.	Job rescheduling is triggered NOTE: If the UnifiedBus interconnect device is faulty, this fault handling policy stops the current training job, isolate the node, and reschedule the job.	--
RestartRequestFault	If the service fails to run, the service request needs to be executed again.	Stop the current training job, isolate the node, and reschedule the job.	Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario.
ResetFault	The service fails to run.	Stop the current training job, isolate the node, and reschedule the job.	--
PreSeparateNPU	Services are not affected for the time being. No job will be scheduled to the processor.	Pre-isolate the processor.	Pre-isolate the processor.
PreSeparateFault	The job may be affected.	If the job is running on the node, the system does not rectify the fault and does not schedule other jobs to the node.	--
SubHealthFault	See subHealthyStrategy in the job YAML file (Table 1).	If the processor is subhealthy, rectify the fault based on the YAML configuration. NOTE: If a fault of another severity level occurs on the processor, SubHealthFault does not affect the handling of that fault.	Perform operations based on the policy.

Parent topic: (Optional) Configuring Fault Detection Levels