Configuration File Description
For interconnect device faults of different levels, Ascend Device Plugin obtains fault codes from SwitchFaultCode.json and handles faults according to the configured fault level. SwitchFaultCode.json is a system configuration file. Do not modify it unless otherwise required. If you need to change the fault level of a fault code, you can modify the mindx-dl-fault-config file created by faultCode.json and SwitchFaultCode.json.
The interconnect devices are only used in
Fault Levels in SwitchFaultCode.json
Resumable training handles interconnect device faults by level. For details about how to change the fault level of a fault code, see (Optional) Configuring Fault Levels of Interconnect Devices.
After obtaining the fault codes from drivers, Ascend Device Plugin classifies the faults into the following levels based their impact on devices and services and perform rescheduling. For details, see Table 1.
Fault Type |
Description |
Rescheduling |
|---|---|---|
NotHandleFault |
Services are not affected. The fault can be automatically rectified. No action is required. |
No handling is required. |
SubHealthFault |
See subHealthyStrategy in the job YAML file (Table 1). |
If a subhealth fault occurs on the processor, rectify the fault based on Job YAML Configuration Example. NOTE:
If a fault of another level occurs on the processor, this policy does not affect the handling of the fault. |
RestartRequestFault |
If the service fails to run, the service request needs to be executed again. |
Stop the current training job, isolate the node, and reschedule the job. |
ResetFault |
The service fails to run. |
|
SeparateFault |
The service fails to run, and the related component or board needs to be replaced. |