Customized Processor Faults
Fault Levels in faultCode.json
Processors faults are handled based on their levels for resumable training. For details about how to change the level of a fault code, see (Optional) Configuring Processor Fault Levels.
After obtaining the processor fault codes from drivers, Ascend Device Plugin classifies the faults into the following levels based on their impacts on devices and services. For details, see Table 1.
Fault Handling Policy |
Description |
Rescheduling |
Graceful Fault Tolerance |
|---|---|---|---|
NotHandleFault |
If the fault does not affect services, no action is required. |
No action is required. |
No action is required. |
RestartRequest |
If service execution is affected, the service request needs to be executed again. |
Isolate the corresponding processor, and reschedule the related job. NOTE:
When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled. |
Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario. |
RestartBusiness |
If service execution is affected, the service needs to be executed again. |
Re-execute the services. |
|
FreeRestartNPU |
If service execution is affected, the processor needs to be reset when it is idle. |
Reset the processor after it becomes idle. |
|
RestartNPU |
If service execution is affected, the processor needs to be reset immediately. |
Stop the training service immediately, reset the processor, and re-execute the service. |
|
SeparateNPU |
The fault cannot be rectified, and the processor needs to be isolated. |
Isolate the corresponding processor, and reschedule the related job. |
|
PreSeparateNPU |
Services are not affected for the time being. No job will be scheduled to the processor. |
Pre-isolate the processors. |
Pre-isolate the processors. |
SubHealthFault |
See subHealthyStrategy in the job YAML file (Table 1). |
If the processor is subhealthy, rectify the fault based on Job YAML Configuration Example. NOTE:
If a fault of another level occurs on the processor, this policy does not affect the handling of the fault. |
Perform operations based on the policy. |
Parameters in faultCustomization.json
If you do not manually modify faultCustomization.json, Ascend Device Plugin rectifies faults based on the default values set in faultCustomization.json.
Level-1 Parameter |
Level-2 Parameter |
Description |
|---|---|---|
GraceTolerance |
- |
Graceful fault tolerance configuration. NOTE:
If GraceTolerance and its sub-parameters do not exist or exceed the value ranges, the default values are used. |
- |
WaitProcessReadCMTime |
Duration for waiting for the management process to read the ConfigMap file when the graceful fault tolerance mode is used. The value ranges from 5 to 90, in seconds. The default value is 30. |
- |
WaitDeviceResetTime |
Maximum duration for waiting for the processor to restart when the graceful fault tolerance mode is used. The value ranges from 60 to 180, in seconds. The default value is 150. |
- |
WaitFaultSelfHealingTime |
Duration for waiting for a RestartBusiness-level fault to recover when the graceful fault tolerance mode is used. The value ranges from 1 to 30, in seconds. The default value is 15. |
FaultFrequency |
- |
User-defined fault frequency. That is, when the number of occurrences of a fault in the time window reaches the upper limit, the fault is handled based on the configured fault handling policy. NOTE:
|
- |
EventId |
Event ID. NOTE:
Only one FaultFrequency parameter can be configured for each fault code (EventId). If multiple FaultFrequency parameters are configured, only the first correct one takes effect. |
- |
TimeWindow |
Time window. That is, statistics are collected on the number of faults in the period specified by TimeWindow (=current time - time window to current time). The value ranges from 60 to 864000, in seconds. |
- |
Times |
Maximum number of resumable training times supported by a job, that is, maximum number of times that a fault occurs. The value ranges from 1 to 100. If the number of occurrences of the fault in a specific time window is greater than or equal to the value of this parameter, the fault is processed and reported according to the policy defined in FaultHandling. |
- |
FaultHandling |
Fault handling policy after the maximum number of resumable training times is reached. Fault handling policies of different levels can be configured. In addition, PreSeparateNPU and ManuallySeparateNPU are also supported. NOTE:
|
FaultDuration |
- |
User-defined fault timeout policy. When the duration of a fault reaches the upper limit, the fault is handled based on the specified fault handling policy. NOTE:
|
- |
EventId |
Fault code ID NOTE:
Only one FaultDuration parameter can be configured for each fault code (EventId). If multiple FaultDuration parameters are configured, only the first correct one takes effect. |
- |
FaultTimeout |
If the fault duration exceeds the value of this parameter, the fault is handled based on the fault handling policy defined in FaultHandling. The value ranges from 0 to 600, in seconds. The default values are as follows:
|
- |
RecoverTimeout |
If the fault recovery time exceeds the value of this parameter, a fault recovery message is reported. The value ranges from 0 to 86400, in seconds. The default values are as follows:
|
- |
FaultHandling |
Fault handling policies after the fault duration expires. You can configure fault handling policies of different levels. PreSeparateNPU is also supported. NOTE:
It is recommended that the fault handling policy after the fault duration expires be set to a higher level than the original fault handling policy. Otherwise, the configuration does not take effect. |
Note
|
||