Configuration Description

Resumable training provides default fault levels and fault handling policies for different fault codes of node hardware faults, processor faults, UnifiedBus interconnect device faults, and public faults. For processor faults, it also provides default fault frequency and duration, and corresponding fault handling policies.

This section describes how to modify fault handling policies. Do not modify the them unless otherwise specified.

Fault Level Description

The following table lists the fault levels supported by different types of faults.

Table 1 Fault levels

Fault Type

Supported Fault Level

Node faults

NotHandleFault, PreSeparateFault, SeparateFault

Processor faults

NotHandleFault, RestartRequest, RestartBusiness, FreeRestartNPU, RestartNPU, SeparateNPU, PreSeparateNPU, SubHealthFault

UnifiedBus interconnect device faults

NotHandleFault, SubHealthFault, ResetFault, SeparateFault, RestartRequestFault

Public faults

NotHandleFault, SeparateNPU, SubHealthFault, PreSeparateNPU

The following table describes fault handling policies of each fault level.

Table 2 Fault severity and handling policies

Policy

Description

Rescheduling

Graceful Fault Tolerance

NotHandleFault

If a fault does not affect services, no action is required.

No action is required.

No action is required.

RestartRequest

If service execution is affected, the service request needs to be executed again.

Isolate the processor and reschedule the job.

NOTE:

When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.

Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario.

RestartBusiness

If service execution is affected, the service needs to be executed again.

Re-execute the service.

FreeRestartNPU

If service execution is affected, the processor needs to be reset when it is idle.

Reset the processor when it is idle.

RestartNPU

If service execution is affected, the processor needs to be reset immediately.

Stop the training service immediately, reset the processor, and re-execute the service.

SeparateNPU

The fault cannot be rectified, and the processor needs to be isolated.

Isolate the processor and reschedule the job.

SeparateFault

The job will be affected.

NOTE:

If the level of the UnifiedBus interconnect device faults is SeparateFault, the service fails to run. In this case, you need to replace the component or board.

Job rescheduling is triggered

NOTE:

If the UnifiedBus interconnect device is faulty, this fault handling policy stops the current training job, isolate the node, and reschedule the job.

--

RestartRequestFault

If the service fails to run, the service request needs to be executed again.

Stop the current training job, isolate the node, and reschedule the job.

Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario.

ResetFault

The service fails to run.

Stop the current training job, isolate the node, and reschedule the job.

--

PreSeparateNPU

Services are not affected for the time being. No job will be scheduled to the processor.

Pre-isolate the processor.

Pre-isolate the processor.

PreSeparateFault

The job may be affected.

If the job is running on the node, the system does not rectify the fault and does not schedule other jobs to the node.

--

SubHealthFault

See subHealthyStrategy in the job YAML file (Table 1).

If the processor is subhealthy, rectify the fault based on the YAML configuration.

NOTE:

If a fault of another severity level occurs on the processor, SubHealthFault does not affect the handling of that fault.

Perform operations based on the policy.