Elastic Agent

Elastic Agent has reached EOL and its documentation will be deleted on the 30th of December, 2026. TaskD will be used to provide the process-level recovery capability.

Application Scenario

Software and hardware faults may occur during the training of a foundation model, disrupting training jobs. A binary package of Elastic Agent deployed on compute nodes is introduced, enabling the management of training jobs on Ascend devices.

Component Function

  • Manage Ascend device processes in the PyTorch framework. That is, stop or restart training processes when a software or hardware fault occurs.
  • Connect to the control plane in the Kubernetes cluster and manage training jobs based on the plane.

Upstream and Downstream Dependencies

Figure 1 Upstream and downstream dependencies
  • The MindCluster cluster scheduling components record details such as the device status and training job status into ConfigMap (reset-config- Job name) through Kubernetes and map the information to the container.
  • Elastic Agent obtains information such as the device status and training job status used by the current training container from ConfigMap.
  • Elastic Agent connects to the Kubernetes cluster control plane to manage training processes.