Configuration File Description

Resumable training is able to handle processor faults based on the fault level, frequency, and duration.

  • For processor faults of different levels, Ascend Device Plugin obtains the fault code of the current fault from faultCode.json and handles the fault according to the configured fault level.
  • For processor faults of different frequencies and durations, Ascend Device Plugin obtains the fault code of the current fault from faultCustomization.json and handles the fault according to the configured fault frequency and duration.

faultCode.json and faultCustomization.json are system configuration files. Do not modify them unless otherwise required. If you need to change the fault level of a fault code, you can modify the mindx-dl-fault-config file created by faultCode.json and faultCustomization.json.

Fault Levels in faultCode.json

Resumable training handles processors faults by level. If you want to change the fault level of a fault code, see (Optional) Configuring Processor Fault Levels.

After obtaining the processor fault codes from the driver, Ascend Device Plugin classifies the faults into the following eight levels based on their impacts on devices and services. For details, see Table 1.

Table 1 Fault handling policy description

Fault Handling Policy

Description

Rescheduling

Graceful Fault Tolerance

NotHandleFault

Faults have no impact on services and do not require handling.

No handling is required.

No handling is required.

RestartRequest

Faults affect service execution. Corresponding service requests need to be re-executed.

Isolate the corresponding processor, and reschedule the related job.

NOTE:

When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.

Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario.

RestartBusiness

Faults affect service execution. Corresponding services need to be re-executed.

Re-execute the services.

FreeRestartNPU

Faults affect service execution. Corresponding processors need to be reset when idle.

Reset the processors after they become idle.

RestartNPU

Faults affect service execution. Corresponding processors need to be reset immediately.

Stop the training service immediately, resets the processors, and re-executes the service.

SeparateNPU

Faults cannot be rectified. Corresponding processors need to be isolated.

Isolate the corresponding processor, and reschedule the related job.

PreSeparateNPU

Services are not affected temporarily. Jobs will not be scheduled to the processor.

Pre-isolate the processors.

Pre-isolate the processors.

SubHealthFault

See subHealthyStrategy in the job YAML file (Table 1).

If a subhealth fault occurs on the processor, rectify the fault based on YAML configurations.

NOTE:

If a fault of another level occurs on the processor,

this policy does not affect the handling of the fault.

Perform operations based on the policy.

  • Stop involved training processes before resetting processors. Otherwise, the reset will fail.
  • If Ascend Device Plugin receives an unrecognized fault code (not saved in faultCode.json) through subscription, it handles the corresponding fault based on the handling suggestion provided by the subscription interface. If the fault level received by the subscription interface is info or minor, the fault is handled as NotHandleFault. If the fault level is not the mentioned two, the fault is handled as SeparateNPU.

Fault Frequencies and Durations

Resumable training handles processor faults based on the fault frequency and duration. Some hardware faults may occur repeatedly in a training job. As a result, the training job is interrupted and rescheduled repeatedly. After obtaining the fault codes of these faults, the cluster scheduling components provide the initial configuration file faultCustomization.json to escalate the fault levels.

Initial Configurations and Fault Types

The faultCustomization.json file provides only initial configurations for escalating the severity of identifiable hardware faults.

If the following fault occurs three times within 24 hours, set the fault level to ManuallySeparateNPU for manual intervention. For details, see Parameters in faultCustomization.json.

The following uses the fault HBMC Ca Parity (fault code: 80E18005) as an example to describe how to change the fault level to ManuallySeparateNPU.
  "FaultFrequency": [
    {
      "EventId": [
        "80C98000","80B78000","80B58000","80A18008","80A38008","80A58008","80B98000","80B98008","80BB8000",
"80BB8008","80BD8000","80BD8008","80C78008","80C98008","80CB8008","80CD8008","80CF8008","80D98008",
"80DF8008","80DE1801","80E01801","80E18008","80E38008","80E39200","80E3A202","80E3A203","80E78000",
"80E78008","80F18000","80F18008","80F38008","80F78008","81318008","81338008","813B8008","81478008",
        "81578008","815F8008","81938008","81958008","81978008"
      ],
      "TimeWindow": 86400,
      "Times": 2,
      "FaultHandling": "ManuallySeparateNPU"
    },
    {
      "EventId": ["80E18005"],
      "TimeWindow": 86400,
      "Times": 3,
      "FaultHandling": "ManuallySeparateNPU"
    }
  ],
  • When ManuallySeparateNPU is used, the processor is still isolated after the fault is rectified. In this case, you need to manually recover the processor that is forcibly isolated. For details, see Step 8.
  • In addition to identifiable hardware faults, the faultCustomization.json file contains the following types of faults:
    • Faults that do not need to be handled: This type of faults does not affect training jobs and devices, and there is no initial configuration to escalate the fault level.
    • Faults with an uncertain type: It is hard to determine whether the fault type is hardware or software, yet they affect training jobs. There is no initial configuration to escalate the fault level. You are advised to manually configure the maximum number of resumable training times supported by a job and the fault handling policy as required when the maximum number is reached. For details, see (Optional) Configuring Processor Fault Frequencies and Durations.
    • Software configuration faults: This fault type is uncommon, and there is no initial configuration to escalate the fault level. You are advised to check whether the software version is correct.

Parameters in faultCustomization.json

If you do not manually modify faultCustomization.json, Ascend Device Plugin rectifies faults based on the default values set in faultCustomization.json.

Table 2 Parameters in faultCustomization.json

Level-1 Parameter

Level-2 Parameter

Description

GraceTolerance

-

Graceful fault tolerance configuration

NOTE:

If GraceTolerance and its sub-parameters do not exist or exceed the value ranges, the default values are used.

-

WaitProcessReadCMTime

Duration for waiting for the management process to read the ConfigMap file when the graceful fault tolerance mode is used. The value ranges from 5 to 90, in seconds. The default value is 30.

-

WaitDeviceResetTime

Maximum duration for waiting for the processor to restart when the graceful fault tolerance mode is used. The value ranges from 60 to 180, in seconds. The default value is 150.

-

WaitFaultSelfHealingTime

Duration for waiting for a RestartBusiness-level fault to recover when the graceful fault tolerance mode is used. The value ranges from 1 to 30, in seconds. The default value is 15.

FaultFrequency

-

User-defined fault frequency. That is, when the number of occurrences of a fault in the time window reaches the upper limit, the fault is handled based on the configured fault handling policy.

NOTE:
  • If the value range of FaultFrequency and its sub-parameters is incorrect, ignore the configuration.
  • If the data format of FaultFrequency and its sub-parameters is incorrect, the default configuration is used.

-

EventId

Fault code

NOTE:

Only one FaultFrequency parameter can be configured for each fault code (EventId). If multiple FaultFrequency parameters are configured, only the first correct one takes effect.

-

TimeWindow

Time window. That is, statistics are collected on the number of faults in the period specified by TimeWindow (= current time - time window to current time). The value ranges from 60 to 864,000, in seconds.

-

Times

Maximum number of resumable training times supported by a job, that is, maximum number of times that a fault occurs. The value ranges from 1 to 100. If the number of occurrences of the fault in a specific time window is greater than or equal to the value of this parameter, the fault is processed and reported according to the policy defined in FaultHandling.

-

FaultHandling

Fault handling policies after the maximum number of resumable training times is reached. Fault handling policies of different levels can be configured.

NOTE:
  • PreSeparateNPU: fault handling policy of foundation models. It pre-isolates processors and determines whether to perform rescheduling based on the actual running status of a training job.
  • ManuallySeparateNPU: fault handling policy that requires manual intervention.
    • If such a policy is used, a message is reported to Kubernetes indicating that the processor is unhealthy, and the processor name is written into device-info-cm.
    • As long as the processor name is saved in this field, the processor is still isolated even if the fault is rectified, until the O&M personnel manually delete the name from this field. For details, see Step 8.
    • This field can be added or modified only by Ascend Device Plugin. O&M personnel can only delete the processor name in it.
    • faultCode.json does not support this policy.

FaultDuration

-

User-defined fault timeout policy. When the duration of a fault reaches the upper limit, the fault is handled based on the specified fault handling policy.

NOTE:
  • If the value range of FaultDuration and its sub-parameters is incorrect, ignore the configuration.
  • If the data format of FaultDuration and its sub-parameters is incorrect, the default configuration is used.

-

EventId

Fault code ID

NOTE:

Only one FaultDuration parameter can be configured for each fault code (EventId). If multiple FaultDuration parameters are configured, only the first correct one takes effect.

-

FaultTimeout

If the fault duration exceeds the value of this parameter, the fault is handled based on the policy defined in FaultHandling. The value ranges from 0 to 600, in seconds. The default values are as follows:
  • The default value is 20 for the parameter plane network fault whose ID is 81078603.
  • The default value is 0 for other faults.

-

RecoverTimeout

If the fault recovery time exceeds the value of this parameter, a fault recovery message is reported. The value ranges from 0 to 86400, in seconds. The default values are as follows:
  • The default value is 60 for the parameter plane network fault whose ID is 81078603.
  • The default value is 0 for other faults.

-

FaultHandling

Fault handling policies after the fault duration expires. Fault handling policies of different levels can be configured.

NOTE:

It is recommended that the fault handling policy after the fault duration expires be set to a higher level than the original policy. Otherwise, the configuration does not take effect.

Note

  • If both FaultFrequency and FaultDuration are configured for a fault code, and the number of timeout times of the fault code in the time window reaches the maximum, the related fault is handled based on the handling policy with highest severity. The handling policies include the fault's own handling policy, and the fault handling policies configured in FaultFrequency and FaultDuration.
  • If both FaultFrequency and FaultDuration are configured for a fault code, the fault frequency increases by one only after the fault times out.
  • For the network fault whose ID is 81078603, the fault handling policy can only be set to NotHandleFault, PreSeparateNPU, or SeparateNPU. If other policies are configured, NotHandleFault is used by default.