Before You Start

This section describes the elastic training feature of Resilience Controller, which has reached its end of life. Related documents will be deleted on the 30th of September, 2026. For details about the latest elastic training capabilities, see Elastic Training.

If a hardware fault occurs and no backup device is available, the cluster scheduling components isolate the faulty node, reset the number of job replicas based on the preset job scale and the number of available nodes in the current cluster, and perform rescheduling and retraining (script adaptation is required).

Prerequisites

Ensure that a corresponding storage scheme has been configured in the environment. For example, to use the network file system (NFS), perform operations described in Installing NFS.
Directory isolation may be required by the NFS based on the actual situation. The NFS random read/write performance must support saving the entire checkpoint file within 15 minutes. You are advised to use a professional storage server. The figure below illustrates NFS performance requirements.
To use the elastic training feature in the CLI scenario, ensure that the following components have been installed.
- Ascend Device Plugin
- Ascend Docker Runtime
- Volcano (Only Volcano can be used as the scheduler to allow for elastic training.)
- Ascend Operator
- NodeD
- Resilience Controller
- ClusterD
If these components are not installed, install them by referring to Installation and Deployment.

Usage Modes

The elastic training feature can be used in either of the following modes:

Use on the CLI: Install cluster scheduling components and use elastic training on the CLI.
Use after integration: Integrate the cluster scheduling components into an existing third-party AI platform or an AI platform developed based on the cluster scheduling components.

Instructions

Resource monitoring can be used together with all features in training scenarios.
Multiple training jobs can be run in a cluster at the same time. Each job can use different features.
If a training node managed by the cluster scheduling components is faulty (such as a network or processor fault on a node where Ascend AI processors are installed and NodeD is enabled), the cluster scheduling components isolate the faulty node and reset the number of job replicas based on the preset job scale and the number of available nodes in the current cluster. This allows for rescheduling and retraining (note that script adaptation is required).
The rescheduling function is implemented by Kubernetes and Volcano or other schedulers.

For details, see Table 1.

**Table 1** Instruction
Scenario	Description
Environment requirements	Ensure that the time of all nodes in the Kubernetes cluster is the same to prevent incorrect program determination.
Environment requirements	It is recommended that the IP address used to check the connectivity between NPUs be set to the IP address of the router.
Troubleshooting	If a fault occurs when a single-server system with multiple processors is used for training, the original job specifications are preferentially used for recovery, and the job specifications comply with the eight-processor, four-processor, dual-processor, or single-processor recovery policy.
	When Resilience Controller is rescheduling a job, it does not handle any new fault of the job.
	If cluster resources are limited and multiple jobs are faulty at the same time, rescheduling can be triggered. As a result, jobs may be in the Pending status due to insufficient resources.
Feature description	This feature does not apply to virtual instances.
	Currently, this feature supports only distributed training vcjob with data parallelism and hybrid parallelism across servers and processors.
	This feature supports only device fault detection and server network fault detection. The details are as follows: The device faults include the service re-execution, processor hot reset, and processor isolation errors reported by the DCMI. The server network faults include device network faults detected by hccn_tool. The detection of server network faults depends on the node status reporting mechanism of NodeD. If NodeD is incorrectly installed or the network between nodes is disconnected, the fault detection function will be affected.

Supported Products

Elastic training is supported by Atlas 800 training servers.

Usage Process

For details about how to use elastic training using the CLI, see Figure 1.

Figure 1 Usage process

Parent topic: Elastic Training