A Processor Fault Causes Training Jobs Interrupted and Rescheduled Repeatedly
Symptom
The processor is faulty repeatedly. As a result, the training job is interrupted and rescheduled repeatedly.
Cause Analysis
If the processor is faulty, resumable training enables the training process to exit and performs rescheduling. The processor is restored to the healthy state through self-healing. As a result, the processor is used for training again during subsequent job scheduling. However, there is a high probability that the fault occurs again on the processor to interupt the training.
Solution
For details about how to configure the maximum number of processor faults and the fault handling level when the maximum number is reached, see Configuring Processor Fault Frequencies and Durations.
Parent topic: Faults During Use