Resumable Training

Function Highlights

When a training job is faulty, the job is rescheduled to a healthy device for training or the faulty processor is automatically recovered.

  • Fault detection: Ascend Device Plugin, Volcano, ClusterD, and NodeD are used to detect job faults.
  • Fault handling: After a fault occurs, rectify the fault based on the reported fault information. The following two fault handling modes are supported:
    • Rescheduling mode: After a fault occurs, jobs are rescheduled to other healthy devices.
    • Graceful fault tolerance mode: If a processor is faulty during training, the system attempts to automatically recover the faulty processor.
  • Training recovery: After a training job is rescheduled, the checkpoint that is automatically saved before the fault occurs is used to resume training.

Required Component

  • Volcano
  • Ascend Operator
  • Ascend Device Plugin
  • Ascend Docker Runtime
  • NodeD
  • ClusterD
  • TaskD
  • (Optional) MindIO ACP
  • (Optional) MindIO TFT

Instructions

  1. Refer to Installation and Deployment for component installation.
  2. Refer to Resumable Training for feature usage.
  3. TaskD must be installed in a container. For details, see Image Creation.
  4. For details about MindIO ACP and its installation procedure, see Checkpoint Saving, Loading, and Optimization.
  5. For details about MindIO TFT and its installation procedure, see Fault Recovery Acceleration.