Custom Interconnect DeviceFaults

The resumable training function processes interconnect device faults by level. For details about how to change the fault level of a fault code, see (Optional) Configuring Fault Levels of Interconnect Devices.

After obtaining an event ID of a fault from the driver, Ascend Device Plugin classifies the fault into the following five levels based on the fault's impact on devices and services to perform rescheduling. For details, see Table 1.

Table 1 Fault levels and handling suggestions

Fault Type

Description

Rescheduling

NotHandleFault

Services are not affected. The fault can be automatically rectified. No action is required.

No handling is required.

SubHealthFault

The service performance is affected. The cause of a subhealth fault needs to be checked.

When a subhealth fault occurs, rectify the fault based on the subhealth policy specified by subHealthyStrategy in Table 1.

RestartRequestFault

If the service fails to run, the service request needs to be executed again.

Stop the current training job, isolate the node, and reschedule the job.

ResetFault

The service fails to run.

SeparateFault

The service fails to run, and the related component or board needs to be replaced.