Configuration File Description
Resumable training is able to handle processor faults based on the fault level, frequency, and duration.
- For processor faults of different levels, Ascend Device Plugin obtains the fault code of the current fault from faultCode.json and handles the fault according to the configured fault level.
- For processor faults of different frequencies and durations, Ascend Device Plugin obtains the fault code of the current fault from faultCustomization.json and handles the fault according to the configured fault frequency and duration.
faultCode.json and faultCustomization.json are system configuration files. Do not modify them unless otherwise required. If you need to change the fault level of a fault code, you can modify the mindx-dl-fault-config file created by faultCode.json and faultCustomization.json.
- For details about the fault code of each fault, see Processor Fault Code Reference Documents.
- For details about the processor fault levels that can be configured, see Fault Levels.
- For details about the processor fault frequencies and durations, see Fault Frequencies and Durations.
Fault Levels in faultCode.json
Resumable training handles processors faults by level. If you want to change the fault level of a fault code, see (Optional) Configuring Processor Fault Levels.
After obtaining the processor fault codes from the driver, Ascend Device Plugin classifies the faults into the following eight levels based on their impacts on devices and services. For details, see Table 1.
Fault Handling Policy |
Description |
Rescheduling |
Graceful Fault Tolerance |
|---|---|---|---|
NotHandleFault |
Faults have no impact on services and do not require handling. |
No handling is required. |
No handling is required. |
RestartRequest |
Faults affect service execution. Corresponding service requests need to be re-executed. |
Isolate the corresponding processor, and reschedule the related job. NOTE:
When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled. |
Re-execute inference requests in the inference scenario, or re-execute training services in the training scenario. |
RestartBusiness |
Faults affect service execution. Corresponding services need to be re-executed. |
Re-execute the services. |
|
FreeRestartNPU |
Faults affect service execution. Corresponding processors need to be reset when idle. |
Reset the processors after they become idle. |
|
RestartNPU |
Faults affect service execution. Corresponding processors need to be reset immediately. |
Stop the training service immediately, resets the processors, and re-executes the service. |
|
SeparateNPU |
Faults cannot be rectified. Corresponding processors need to be isolated. |
Isolate the corresponding processor, and reschedule the related job. |
|
PreSeparateNPU |
Services are not affected temporarily. Jobs will not be scheduled to the processor. |
Pre-isolate the processors. |
Pre-isolate the processors. |
SubHealthFault |
See subHealthyStrategy in the job YAML file (Table 1). |
If a subhealth fault occurs on the processor, rectify the fault based on YAML configurations. NOTE:
If a fault of another level occurs on the processor, this policy does not affect the handling of the fault. |
Perform operations based on the policy. |
- Stop involved training processes before resetting processors. Otherwise, the reset will fail.
- If Ascend Device Plugin receives an unrecognized fault code (not saved in faultCode.json) through subscription, it handles the corresponding fault based on the handling suggestion provided by the subscription interface. If the fault level received by the subscription interface is info or minor, the fault is handled as NotHandleFault. If the fault level is not the mentioned two, the fault is handled as SeparateNPU.
Fault Frequencies and Durations
Resumable training handles processor faults based on the fault frequency and duration. Some hardware faults may occur repeatedly in a training job. As a result, the training job is interrupted and rescheduled repeatedly. After obtaining the fault codes of these faults, the cluster scheduling components provide the initial configuration file faultCustomization.json to escalate the fault levels.
- For relationships between the initial configurations and fault types provided by faultCustomization.json, see Initial Configurations and Fault Types.
- For the default configurations (default values) of faultCustomization.json, see Table 2.
- For details about how to change the fault frequency and duration, see (Optional) Configuring Processor Fault Frequencies and Durations.
Initial Configurations and Fault Types
The faultCustomization.json file provides only initial configurations for escalating the severity of identifiable hardware faults.
If the following fault occurs three times within 24 hours, set the fault level to ManuallySeparateNPU for manual intervention. For details, see Parameters in faultCustomization.json.
"FaultFrequency": [
{
"EventId": [
"80C98000","80B78000","80B58000","80A18008","80A38008","80A58008","80B98000","80B98008","80BB8000",
"80BB8008","80BD8000","80BD8008","80C78008","80C98008","80CB8008","80CD8008","80CF8008","80D98008",
"80DF8008","80DE1801","80E01801","80E18008","80E38008","80E39200","80E3A202","80E3A203","80E78000",
"80E78008","80F18000","80F18008","80F38008","80F78008","81318008","81338008","813B8008","81478008",
"81578008","815F8008","81938008","81958008","81978008"
],
"TimeWindow": 86400,
"Times": 2,
"FaultHandling": "ManuallySeparateNPU"
},
{
"EventId": ["80E18005"],
"TimeWindow": 86400,
"Times": 3,
"FaultHandling": "ManuallySeparateNPU"
}
],
- When ManuallySeparateNPU is used, the processor is still isolated after the fault is rectified. In this case, you need to manually recover the processor that is forcibly isolated. For details, see Step 8.
- In addition to identifiable hardware faults, the faultCustomization.json file contains the following types of faults:
- Faults that do not need to be handled: This type of faults does not affect training jobs and devices, and there is no initial configuration to escalate the fault level.
- Faults with an uncertain type: It is hard to determine whether the fault type is hardware or software, yet they affect training jobs. There is no initial configuration to escalate the fault level. You are advised to manually configure the maximum number of resumable training times supported by a job and the fault handling policy as required when the maximum number is reached. For details, see (Optional) Configuring Processor Fault Frequencies and Durations.
- Software configuration faults: This fault type is uncommon, and there is no initial configuration to escalate the fault level. You are advised to check whether the software version is correct.
Parameters in faultCustomization.json
If you do not manually modify faultCustomization.json, Ascend Device Plugin rectifies faults based on the default values set in faultCustomization.json.
Level-1 Parameter |
Level-2 Parameter |
Description |
|---|---|---|
GraceTolerance |
- |
Graceful fault tolerance configuration NOTE:
If GraceTolerance and its sub-parameters do not exist or exceed the value ranges, the default values are used. |
- |
WaitProcessReadCMTime |
Duration for waiting for the management process to read the ConfigMap file when the graceful fault tolerance mode is used. The value ranges from 5 to 90, in seconds. The default value is 30. |
- |
WaitDeviceResetTime |
Maximum duration for waiting for the processor to restart when the graceful fault tolerance mode is used. The value ranges from 60 to 180, in seconds. The default value is 150. |
- |
WaitFaultSelfHealingTime |
Duration for waiting for a RestartBusiness-level fault to recover when the graceful fault tolerance mode is used. The value ranges from 1 to 30, in seconds. The default value is 15. |
FaultFrequency |
- |
User-defined fault frequency. That is, when the number of occurrences of a fault in the time window reaches the upper limit, the fault is handled based on the configured fault handling policy. NOTE:
|
- |
EventId |
Fault code NOTE:
Only one FaultFrequency parameter can be configured for each fault code (EventId). If multiple FaultFrequency parameters are configured, only the first correct one takes effect. |
- |
TimeWindow |
Time window. That is, statistics are collected on the number of faults in the period specified by TimeWindow (= current time - time window to current time). The value ranges from 60 to 864,000, in seconds. |
- |
Times |
Maximum number of resumable training times supported by a job, that is, maximum number of times that a fault occurs. The value ranges from 1 to 100. If the number of occurrences of the fault in a specific time window is greater than or equal to the value of this parameter, the fault is processed and reported according to the policy defined in FaultHandling. |
- |
FaultHandling |
Fault handling policies after the maximum number of resumable training times is reached. Fault handling policies of different levels can be configured. NOTE:
|
FaultDuration |
- |
User-defined fault timeout policy. When the duration of a fault reaches the upper limit, the fault is handled based on the specified fault handling policy. NOTE:
|
- |
EventId |
Fault code ID NOTE:
Only one FaultDuration parameter can be configured for each fault code (EventId). If multiple FaultDuration parameters are configured, only the first correct one takes effect. |
- |
FaultTimeout |
If the fault duration exceeds the value of this parameter, the fault is handled based on the policy defined in FaultHandling. The value ranges from 0 to 600, in seconds. The default values are as follows:
|
- |
RecoverTimeout |
If the fault recovery time exceeds the value of this parameter, a fault recovery message is reported. The value ranges from 0 to 86400, in seconds. The default values are as follows:
|
- |
FaultHandling |
Fault handling policies after the fault duration expires. Fault handling policies of different levels can be configured. NOTE:
It is recommended that the fault handling policy after the fault duration expires be set to a higher level than the original policy. Otherwise, the configuration does not take effect. |
Note
|
||