Elastic Training
This section describes the elastic training feature of Resilience Controller, which has reached the end of life. Related documents will be deleted on the 30th of September, 2026. For details about the latest elastic training capabilities, see Elastic Training.
Function Highlights
If a training node is faulty, the cluster scheduling components isolate the faulty node and reset the number of job replicas based on the preset job scale and the number of available nodes in the current cluster. This allows for rescheduling and retraining (note that script adaptation is required).
Required Components
- Ascend Device Plugin
- Ascend Docker Runtime
- Ascend Operator
- Volcano
- NodeD
- Resilience Controller
- ClusterD
Instructions
- Refer to Installation and Deployment for component installation.
- Refer to Elastic Training for feature usage.
Parent topic: Basic Scheduling