Resumable Training
Function Highlights
When a training job is faulty, the job is rescheduled to a healthy device for training or the faulty processor is automatically recovered.
- Fault detection: Ascend Device Plugin, Volcano, ClusterD, and NodeD are used to detect job faults.
- Fault handling: After a fault occurs, rectify the fault based on the reported fault information. The following two fault handling modes are supported:
- Rescheduling mode: After a fault occurs, jobs are rescheduled to other healthy devices.
- Graceful fault tolerance mode: If a processor is faulty during training, the system attempts to automatically recover the faulty processor.
- Training recovery: After a training job is rescheduled, the checkpoint that is automatically saved before the fault occurs is used to resume training.
Required Component
- Volcano
- Ascend Operator
- Ascend Device Plugin
- Ascend Docker Runtime
- NodeD
- ClusterD
- TaskD
- (Optional) MindIO ACP
- (Optional) MindIO TFT
Instructions
- Refer to Installation and Deployment for component installation.
- Refer to Resumable Training for feature usage.
- TaskD must be installed in a container. For details, see Image Creation.
- For details about MindIO ACP and its installation procedure, see Checkpoint Saving, Loading, and Optimization.
- For details about MindIO TFT and its installation procedure, see Fault Recovery Acceleration.
Parent topic: Feature Description