A Processor Fault Causes Training Jobs Interrupted and Rescheduled Repeatedly

Symptom

The processor is faulty repeatedly. As a result, the training job is interrupted and rescheduled repeatedly.

Cause Analysis

If the processor is faulty, resumable training enables the training process to exit and performs rescheduling. The processor is restored to the healthy state through self-healing. As a result, the processor is used for training again during subsequent job scheduling. However, there is a high probability that the fault occurs again on the processor to interupt the training.

Solution

For details about how to configure the maximum number of processor faults and the fault handling level when the maximum number is reached, see Configuring Processor Fault Frequencies and Durations.