Elastic Training

This section describes the elastic training feature of Resilience Controller, which has reached the end of life. Related documents will be deleted on the 30th of September, 2026. For details about the latest elastic training capabilities, see Elastic Training.

Function Highlights

If a training node is faulty, the cluster scheduling components isolate the faulty node and reset the number of job replicas based on the preset job scale and the number of available nodes in the current cluster. This allows for rescheduling and retraining (note that script adaptation is required).

Required Components

Ascend Device Plugin
Ascend Docker Runtime
Ascend Operator
Volcano
NodeD
Resilience Controller
ClusterD

Instructions

Refer to Installation and Deployment for component installation.
Refer to Elastic Training for feature usage.

Parent topic: Basic Scheduling