Configuration File Description

Resumable training handles public faults by level. ClusterD obtains the fault code of the current fault and handles the fault based on the fault level configured in publicFaultConfiguration.json. In special cases, if ClusterD receives an unidentified fault code (not saved in the configuration file), the fault is discarded.

publicFaultConfiguration.json is the system configuration file of public faults. Do not modify it unless otherwise specified. If you need to change the severity and sender of a public fault, you can write the custom configuration file publicCustomization.json in /user1/mindx-dl/clusterd. The file path can be configured. The configuration method is as follows.

  • publicCustomization.json is stored in /user1/mindx-dl/clusterd (default host path) in the container. It cannot be modified or contain soft links.
  • You can configure the host path as required by modifying the host mount path of the config-clusterd volume in the YAML file for starting ClusterD.
  • If there are multiple master nodes, you are advised to synchronize the latest publicCustomization.json file on each master node. This prevents ClusterD from being scheduled to another master node after restart and the custom fault configuration file from being lost.
Table 1 Fault handling policy description

Fault Level

Fault Handling Policy

Rescheduling

NotHandleFault

No handling is required.

No handling is required.

SeparateNPU

Faults cannot be rectified. Corresponding processors need to be isolated.

Isolate the corresponding processor, and reschedule the related job.

SubHealthFault

See subHealthyStrategy in the job YAML file (Table 1).

If the processor is subhealthy, rectify the fault based on Job YAML Configuration Example.

NOTE:

If a fault of another severity level occurs on the processor, SubHealthFault does not affect the handling of that fault.

PreSeparateNPU

Services are not affected for the time being. No job will be scheduled to the processor.

Pre-isolate the processor.

Table 2 publicFaultConfiguration.json fields

Parameter

Description

publicFaultCode

Configurations related to public fault codes

publicFaultResource

Configurations related to public fault senders

Table 3 publicFaultCode fields

Parameter

Description

NotHandleFaultCodes

Fault code whose fault level is NotHandleFault.

SubHealthFaultCodes

Fault code whose fault level is SubHealthFault.

SeparateNPUCodes

Fault code whose fault level is SeparateNPU.

PreSeparateNPUCodes

Fault code whose fault level is PreSeparateNPU.

Fault Code Description

The fault code of a public fault consists of nine bits, which are described as follows.

Table 4 Fault code description

Bit

Description

Value

1

Fault type

0: processor fault

1: node fault

2: network fault

3: storage fault

2

Default fault level

0: NotHandleFault

1: SubHealthFault

2: SeparateNPU

3 and 4

Reserved for extension

00 for now

5

Whether the fault code in the sixth to ninth bits is user-defined to avoid conflicts

0: defined in the release package

1: user-defined

6-9

Decimal fault code

Example: 1001

Example:

0100 01001: processor fault with fault code 1001 at the SubHealthFault level, which is defined in the release package.

1000 11002: node fault with fault code 1002 at the NotHandleFault level, which is user-defined.

2200 01003: network fault with fault code 1003 at the SeparateNPU level, which is defined in the release package.

Known Public Faults

Table 5 Known public faults

Fault Code

Fault Description

Default Fault Level

010001001

Optical link contamination (processor fault)

SubHealthFault

210001007

Optical link contamination (network fault)

SubHealthFault

220001001

NPU HCCS network fault

SeparateNPU

010001004

Loose optical link (processor fault)

SubHealthFault

210001008

Loose optical link (network fault)

SubHealthFault

310001005

DPC client failure

SubHealthFault

200001006

Suspected optical link subhealth

NotHandleFault

210001009

Optical module subhealth

SubHealthFault

220001002

Non-existent backup subrack resources used for scheduling in backup SuperPoD scenarios

SeparateNPU

220001003

Backup subrack resource port fault

SeparateNPU

220001004

Job ID conflict of the backup subrack

SeparateNPU

220001005

Invalid NetMind

SeparateNPU

220001006

Potential invalid ports on the backup subrack link

SeparateNPU

220001007

Optical link adjustment failure

SeparateNPU

200001010

Slow network detected/recovered in a node (slow network fault)

NotHandleFault

200001011

Inter-node slow network detected/recovered in a SuperPoD (slow network fault)

NotHandleFault

200001012

Slow network not caused by a card fault (slow network fault)

NotHandleFault

110001010

Slow node fault

SubHealthFault

100001011

Deterioration recovered (slow node fault)

NotHandleFault

110001020

Abnormal DPC process of shared storage

SubHealthFault

110001021

Insufficient DPC memory of shared storage

SubHealthFault