Feature Introduction

When a training node is faulty, the system isolates the faulty resource (corresponding chip or node) and automatically reschedules and retrains the job that is running when the fault occurs (script adaptation is required). This feature is called resumable training, which is supported by MindX DL and ModelArts. The resumable training feature provides two functions: rescheduling and retraining. The retraining feature includes the basic function "fault recovery" and the advanced function "dying gasp". The fault recovery function resumes training from the periodically saved checkpoints. The dying gasp function saves the parameter status in the memory between the fault occurrence time and the periodic checkpoint to reduce training loss.