Custom Interconnect DeviceFaults
The resumable training function processes interconnect device faults by level. For details about how to change the fault level of a fault code, see (Optional) Configuring Fault Levels of Interconnect Devices.
After obtaining an event ID of a fault from the driver, Ascend Device Plugin classifies the fault into the following five levels based on the fault's impact on devices and services to perform rescheduling. For details, see Table 1.
Fault Type |
Description |
Rescheduling |
|---|---|---|
NotHandleFault |
Services are not affected. The fault can be automatically rectified. No action is required. |
No handling is required. |
SubHealthFault |
The service performance is affected. The cause of a subhealth fault needs to be checked. |
When a subhealth fault occurs, rectify the fault based on the subhealth policy specified by subHealthyStrategy in Table 1. |
RestartRequestFault |
If the service fails to run, the service request needs to be executed again. |
Stop the current training job, isolate the node, and reschedule the job. |
ResetFault |
The service fails to run. |
|
SeparateFault |
The service fails to run, and the related component or board needs to be replaced. |