Elastic Agent
Elastic Agent has reached EOL and its documentation will be deleted on the 30th of December, 2026. TaskD will be used to provide the process-level recovery capability.
Application Scenario
Software and hardware faults may occur during the training of a foundation model, disrupting training jobs. A binary package of Elastic Agent deployed on compute nodes is introduced, enabling the management of training jobs on Ascend devices.
Component Function
- Manage Ascend device processes in the PyTorch framework. That is, stop or restart training processes when a software or hardware fault occurs.
- Connect to the control plane in the Kubernetes cluster and manage training jobs based on the plane.
Upstream and Downstream Dependencies
Figure 1 Upstream and downstream dependencies


- The MindCluster cluster scheduling components record details such as the device status and training job status into ConfigMap (reset-config- Job name) through Kubernetes and map the information to the container.
- Elastic Agent obtains information such as the device status and training job status used by the current training container from ConfigMap.
- Elastic Agent connects to the Kubernetes cluster control plane to manage training processes.
Parent topic: Component Description