Resilience Controller

Resilience Controller has reached EOL and its documentation will be deleted on the 30th of September, 2026. For details about the latest elastic training capabilities, see Elastic Training.

Application Scenario

If a training job encounters a fault and there are not enough healthy resources available to replace the faulty ones, you can enable dynamic scale-in to keep the training job running. Once sufficient resources are available, enable dynamic scale-out to restore the training job. Resilience Controller is provided for dynamic scaling during execution of training jobs.

Component Function

Provide elastic scale-in training services. When the hardware used by a training job is faulty, the hardware can be removed to continue training.

Upstream and Downstream Dependencies

Resilience Controller is a Kubernetes plugin and needs to be installed in the Kubernetes cluster. Resilience Controller supports only Volcano Jobs. Therefore, Volcano must be installed in the cluster. During the running of Resilience Controller, it interacts only with Kubernetes, as shown in the following figure.

Figure 1 Upstream and downstream dependencies

The MindCluster cluster scheduling components write information such as the NPU device, node status, and scheduling configuration to ConfigMap through Kubernetes.
Resilience Controller reads the NodeInfo field in ConfigMap whose name prefix is mindx-dl-nodeinfo- in the mindx-dl namespace to obtain the node heartbeat status.
Resilience Controller reads the DeviceInfoCfg field in ConfigMap whose name prefix is mindx-dl-deviceinfo- in the kube-system namespace to obtain the NPU health status.
Resilience Controller reads the grace-over-time field in ConfigMap named volcano-scheduler in the volcano-system namespace to obtain the timeout configuration for graceful pod deletion during rescheduling.
Resilience Controller designates all nodes whose label is nodeDEnable=on in the cluster as the scheduling resource pool.
Resilience Controller obtains all Volcano Job pods in the cluster and reads huawei.com/AscendReal to obtain the NPU list used by the pods.
Resilience Controller reads a Volcano Job and obtains fields such as fault-scheduling, elastic-scheduling, minReplicas, and phase to determine whether the Volcano Job supports elastic training.
When a device or node is faulty, Resilience Controller creates a Volcano Job with half of the required NPU resources based on the number of replicas of the original Volcano Job and the cluster resources.

Parent topic: Component Description