Configuration File Description

Resumable training ignores collateral faults caused by special faults in associated scenarios. ClusterD obtains special faults and process them based on the associated fault policy configured in the relationFaultCustomization.json and faultDuration.json files.

relationFaultCustomization.json and faultDuration.json are system configuration files. Do not modify them unless otherwise required.

Table 1 relationFaultCustomization file description

Parameter

Description

Value

TriggerFault

Collateral fault code. Currently, fault codes configured in faultCode.json and SwitchFaultCode.json are supported.

String

RelationFaults

List of faults to be associated, which can be one or more fault codes. Currently, fault codes configured in faultCode.json and SwitchFaultCode.json are supported.

String list

FaultStrategy

Processing policy of a job when the associated fault is successfully matched.

  • Separate: job isolation
  • SubHealth: job subhealth

String

Note:

When a fault configured by RelationFaults occurs, ClusterD adds the fault to the fault code queue to be processed. If the fault corresponding to TriggerFault occurs within the interval configured by TimeOutInterval, a job is processed based on the configured FaultStrategy. If the interval exceeds the value of TimeOutInterval, the interconnect device fault is processed using the SubHealth policy. If a processor fault or parameter plane network fault occurs, the fault is ignored.

Table 2 faultDuration.json file description

Parameter

Description

Value

FaultCode

Fault code. Currently, fault codes configured in faultCode.json and SwitchFaultCode.json are supported.

String

FaultType

Fault type:

  • faultDevice: processor or parameter plane network fault
  • faultSwitch: interconnect device fault

String

TimeOutInterval

Maximum association time of a fault code, in seconds.

Integer