Customized Processor Faults

Fault Levels in faultCode.json

Processors faults are handled based on their levels for resumable training. For details about how to change the level of a fault code, see (Optional) Configuring Processor Fault Levels.

After obtaining the processor fault codes from drivers, Ascend Device Plugin classifies the faults into the following levels based on their impacts on devices and services. For details, see Table 1.

Table 1 Fault levels and handling suggestions

Fault Handling Policy

Description

Rescheduling

Graceful Fault Tolerance

NotHandleFault

If the fault does not affect services, no action is required.

No action is required.

No action is required.

RestartRequest

If service execution is affected, the service request needs to be executed again.

Isolate the corresponding processor, and reschedule the related job.

NOTE:

When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.

Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario.

RestartBusiness

If service execution is affected, the service needs to be executed again.

Re-execute the services.

FreeRestartNPU

If service execution is affected, the processor needs to be reset when it is idle.

Reset the processor after it becomes idle.

RestartNPU

If service execution is affected, the processor needs to be reset immediately.

Stop the training service immediately, reset the processor, and re-execute the service.

SeparateNPU

The fault cannot be rectified, and the processor needs to be isolated.

Isolate the corresponding processor, and reschedule the related job.

PreSeparateNPU

Services are not affected for the time being. No job will be scheduled to the processor.

Pre-isolate the processors.

Pre-isolate the processors.

SubHealthFault

See subHealthyStrategy in the job YAML file (Table 1).

If the processor is subhealthy, rectify the fault based on Job YAML Configuration Example.

NOTE:

If a fault of another level occurs on the processor,

this policy does not affect the handling of the fault.

Perform operations based on the policy.

Parameters in faultCustomization.json

If you do not manually modify faultCustomization.json, Ascend Device Plugin rectifies faults based on the default values set in faultCustomization.json.

Table 2 Parameters in faultCustomization.json

Level-1 Parameter

Level-2 Parameter

Description

GraceTolerance

-

Graceful fault tolerance configuration.

NOTE:

If GraceTolerance and its sub-parameters do not exist or exceed the value ranges, the default values are used.

-

WaitProcessReadCMTime

Duration for waiting for the management process to read the ConfigMap file when the graceful fault tolerance mode is used. The value ranges from 5 to 90, in seconds. The default value is 30.

-

WaitDeviceResetTime

Maximum duration for waiting for the processor to restart when the graceful fault tolerance mode is used. The value ranges from 60 to 180, in seconds. The default value is 150.

-

WaitFaultSelfHealingTime

Duration for waiting for a RestartBusiness-level fault to recover when the graceful fault tolerance mode is used. The value ranges from 1 to 30, in seconds. The default value is 15.

FaultFrequency

-

User-defined fault frequency. That is, when the number of occurrences of a fault in the time window reaches the upper limit, the fault is handled based on the configured fault handling policy.

NOTE:
  • If the value range of FaultFrequency and its sub-parameters is incorrect, ignore the configuration.
  • If the data format of FaultFrequency and its sub-parameters is incorrect, the default configuration is used.

-

EventId

Event ID.

NOTE:

Only one FaultFrequency parameter can be configured for each fault code (EventId). If multiple FaultFrequency parameters are configured, only the first correct one takes effect.

-

TimeWindow

Time window. That is, statistics are collected on the number of faults in the period specified by TimeWindow (=current time - time window to current time). The value ranges from 60 to 864000, in seconds.

-

Times

Maximum number of resumable training times supported by a job, that is, maximum number of times that a fault occurs. The value ranges from 1 to 100. If the number of occurrences of the fault in a specific time window is greater than or equal to the value of this parameter, the fault is processed and reported according to the policy defined in FaultHandling.

-

FaultHandling

Fault handling policy after the maximum number of resumable training times is reached. Fault handling policies of different levels can be configured. In addition, PreSeparateNPU and ManuallySeparateNPU are also supported.

NOTE:
  • PreSeparateNPU: fault handling policy of foundation models. It pre-isolates processors and determines whether to perform rescheduling based on the actual running status of a training job.
  • ManuallySeparateNPU: fault handling policy that requires manual intervention.
    • If such a policy is used, a message indicating that the processor is unhealthy is reported to Kubernetes, and the processor name is written into device-info-cm.
    • As long as the processor name is saved in this field, the processor is still isolated even if the fault is rectified, until the O&M personnel manually delete the name from this field.
    • This field can be added or modified only by Ascend Device Plugin. O&M personnel can only delete the processor name in it.
    • faultCode.json does not support this policy.

FaultDuration

-

User-defined fault timeout policy. When the duration of a fault reaches the upper limit, the fault is handled based on the specified fault handling policy.

NOTE:
  • If the value range of FaultDuration and its sub-parameters is incorrect, ignore the configuration.
  • If the data format of FaultDuration and its sub-parameters is incorrect, the default configuration is used.

-

EventId

Fault code ID

NOTE:

Only one FaultDuration parameter can be configured for each fault code (EventId). If multiple FaultDuration parameters are configured, only the first correct one takes effect.

-

FaultTimeout

If the fault duration exceeds the value of this parameter, the fault is handled based on the fault handling policy defined in FaultHandling. The value ranges from 0 to 600, in seconds. The default values are as follows:
  • The default value is 20 for the parameter plane network fault whose ID is 81078603.
  • The default value is 30 for the on-chip memory double-bit fault whose ID is 80E01801.
  • The default value is 0 for other faults.

-

RecoverTimeout

If the fault recovery time exceeds the value of this parameter, a fault recovery message is reported. The value ranges from 0 to 86400, in seconds. The default values are as follows:
  • The default value is 60 for the parameter plane network fault whose ID is 81078603. It is not recommended to set this parameter to 0. Ensure that its value is greater than listWatchPeriod. For details about listWatchPeriod, see Table 3.
  • The default value is 0 for other faults.

-

FaultHandling

Fault handling policies after the fault duration expires. You can configure fault handling policies of different levels. PreSeparateNPU is also supported.

NOTE:

It is recommended that the fault handling policy after the fault duration expires be set to a higher level than the original fault handling policy. Otherwise, the configuration does not take effect.

Note

  • If both FaultFrequency and FaultDuration are configured for a fault code, and the number of timeout times of the fault code in the time window reaches the maximum, the related fault is handled based on the handling policy with highest severity. The handling policies include the fault's own handling policy, and the fault handling policies configured in FaultFrequency and FaultDuration.
  • If both FaultFrequency and FaultDuration are configured for a fault code, the fault frequency increases by one only after the fault times out.
  • For the network fault whose ID is 81078603, the fault handling policy can only be set to NotHandleFault, PreSeparateNPU, or SeparateNPU. If other policies are configured, the default configuration NotHandleFault is used.