Configuration File Description
Resumable training handles public faults by level. ClusterD obtains the fault code of the current fault and handles the fault based on the fault level configured in publicFaultConfiguration.json. In special cases, if ClusterD receives an unidentified fault code (not saved in the configuration file), the fault is discarded.
publicFaultConfiguration.json is the system configuration file of public faults. Do not modify it unless otherwise specified. If you need to change the severity and sender of a public fault, you can write the custom configuration file publicCustomization.json in /user1/mindx-dl/clusterd. The file path can be configured. The configuration method is as follows.
- publicCustomization.json is stored in /user1/mindx-dl/clusterd (default host path) in the container. It cannot be modified or contain soft links.
- You can configure the host path as required by modifying the host mount path of the config-clusterd volume in the YAML file for starting ClusterD.
- If there are multiple master nodes, you are advised to synchronize the latest publicCustomization.json file on each master node. This prevents ClusterD from being scheduled to another master node after restart and the custom fault configuration file from being lost.
Fault Level |
Fault Handling Policy |
Rescheduling |
|---|---|---|
NotHandleFault |
No handling is required. |
No handling is required. |
SeparateNPU |
Faults cannot be rectified. Corresponding processors need to be isolated. |
Isolate the corresponding processor, and reschedule the related job. |
SubHealthFault |
See subHealthyStrategy in the job YAML file (Table 1). |
If the processor is subhealthy, rectify the fault based on Job YAML Configuration Example. NOTE:
If a fault of another severity level occurs on the processor, SubHealthFault does not affect the handling of that fault. |
PreSeparateNPU |
Services are not affected for the time being. No job will be scheduled to the processor. |
Pre-isolate the processor. |
Parameter |
Description |
|---|---|
Configurations related to public fault codes |
|
publicFaultResource |
Configurations related to public fault senders |
Parameter |
Description |
|---|---|
NotHandleFaultCodes |
Fault code whose fault level is NotHandleFault. |
SubHealthFaultCodes |
Fault code whose fault level is SubHealthFault. |
SeparateNPUCodes |
Fault code whose fault level is SeparateNPU. |
PreSeparateNPUCodes |
Fault code whose fault level is PreSeparateNPU. |
Fault Code Description
The fault code of a public fault consists of nine bits, which are described as follows.
Bit |
Description |
Value |
|---|---|---|
1 |
Fault type |
0: processor fault 1: node fault 2: network fault 3: storage fault |
2 |
Default fault level |
0: NotHandleFault 1: SubHealthFault 2: SeparateNPU |
3 and 4 |
Reserved for extension |
00 for now |
5 |
Whether the fault code in the sixth to ninth bits is user-defined to avoid conflicts |
0: defined in the release package 1: user-defined |
6-9 |
Decimal fault code |
Example: 1001 |
Example: 0100 01001: processor fault with fault code 1001 at the SubHealthFault level, which is defined in the release package. 1000 11002: node fault with fault code 1002 at the NotHandleFault level, which is user-defined. 2200 01003: network fault with fault code 1003 at the SeparateNPU level, which is defined in the release package. |
||
Known Public Faults
Fault Code |
Fault Description |
Default Fault Level |
|---|---|---|
010001001 |
Optical link contamination (processor fault) |
SubHealthFault |
210001007 |
Optical link contamination (network fault) |
SubHealthFault |
220001001 |
NPU HCCS network fault |
SeparateNPU |
010001004 |
Loose optical link (processor fault) |
SubHealthFault |
210001008 |
Loose optical link (network fault) |
SubHealthFault |
310001005 |
DPC client failure |
SubHealthFault |
200001006 |
Suspected optical link subhealth |
NotHandleFault |
210001009 |
Optical module subhealth |
SubHealthFault |
220001002 |
Non-existent backup subrack resources used for scheduling in backup SuperPoD scenarios |
SeparateNPU |
220001003 |
Backup subrack resource port fault |
SeparateNPU |
220001004 |
Job ID conflict of the backup subrack |
SeparateNPU |
220001005 |
Invalid NetMind |
SeparateNPU |
220001006 |
Potential invalid ports on the backup subrack link |
SeparateNPU |
220001007 |
Optical link adjustment failure |
SeparateNPU |
200001010 |
Slow network detected/recovered in a node (slow network fault) |
NotHandleFault |
200001011 |
Inter-node slow network detected/recovered in a SuperPoD (slow network fault) |
NotHandleFault |
200001012 |
Slow network not caused by a card fault (slow network fault) |
NotHandleFault |
110001010 |
Slow node fault |
SubHealthFault |
100001011 |
Deterioration recovered (slow node fault) |
NotHandleFault |
110001020 |
Abnormal DPC process of shared storage |
SubHealthFault |
110001021 |
Insufficient DPC memory of shared storage |
SubHealthFault |