Overall Architecture

In the event of a training job fault within a Kubernetes cluster, resumable training allows the system to detect the fault, handle or isolate the faulty resources, reassign resources according to the training job requirements, and restart the training job using checkpoints saved periodically or prior to the fault, thereby reducing downtime.

Overall Architecture of Resumable Training

Figure 1 shows the architecture of resumable training.

Figure 1 Overall architecture

The functions of each component are as follows:

Ascend Device Plugin: fault detection component, which manages NPU resources, reports NPU faults and NPU network faults, and performs NPU hot resets.
NodeD: fault detection component, which reports node health status, node hardware (including CPUs, memory, and processors) faults, UnifiedBus network faults, and DPC shared storage faults.
Volcano: fault handling component, which provides the capability of rescheduling faulty jobs.
Ascend Operator: generates environment variables for distributed training jobs running in different AI frameworks and provides the RankTable information required for collective communication in static networking.
ClusterD: obtains all data reported by Ascend Device Plugin and NodeD in a cluster, sorts the data, and sends it to Volcano.
TaskD: provides the communication function with the control center of the Kubernetes training cluster to complete training recovery, and monitors the training status of training and inference jobs on Ascend devices, and controls the training status.
MindIO TTP: verifies the integrity and consistency of intermediate status data following a fault during foundation model training, creates a dying gasp checkpoint, and utilizes the checkpoint to restore training, minimizing the iteration loss caused by the fault.
Training model code: Adaptation operations related to resumable training are required.

E2E Process

Resumable training is triggered upon faults. After fault detection, fault recovery, and training recovery phases, training can be resumed.

Figure 2 E2E process

The description of each step is as follows:

The device status is queried in polling mode. Ascend Device Plugin obtains the NPU status from the DCMI and information, including node health status, node hardware fault, and UnifiedBus network fault, reported by NodeD. ClusterD sorts out all fault information, determines the final fault status, and reports the fault status to Volcano.
After a node or processor fault is detected, isolate the faulty node or processor to prevent a job from being scheduled to it again.
Stop the training process and exit the training container.
For a node or processor fault, the system reschedules the training job to a healthy device and restart the training container. When the training job is rescheduled, the node that does not cause the rescheduling of the training job is preferentially selected.
Run the training script to restart the training process.
O&M personnel determine whether hot reset can be performed based on the node or processor fault type.
Perform hot reset to restore the device to the healthy status.
The recovered device is automatically added to the cluster again.
For an unrecoverable device, the O&M monitoring system will report its fault.
Manually repair or replace the unrecoverable device offline.

If resumable training is triggered by a service plane fault, only steps 3 to 5 are required.

Component Call Process

Figure 3 shows the component call process for resumable training.

Figure 3 Architectures and principles

The description of each step is as follows:

Ascend Device Plugin detects and reports faults and health status.
NodeD updates the node hardware fault information so that Volcano can accurately determine the node fault type.
ClusterD determines whether the processor is healthy based on the information provided by Ascend Device Plugin.
ClusterD obtains the fault information reported by NodeD.
ClusterD summarizes the collected processor and node information and saves it to ConfigMap.
Volcano obtains the device information of the entire cluster. If fault information exists on the device used by the job, Volcano schedules the job to other healthy devices.
Volcano selects nodes and processors based on affinity rules and schedules training jobs to nodes that meet requirements after Volcano creates pods.
Ascend Device Plugin allocates processors based on processor IDs specified by Volcano on the pod and writes the processor IP addresses to containers.
Before container startup, Ascend Docker Runtime automatically mounts NPU-related devices, and files and directories such as driver .so files.
Ascend Operator writes related environment variables (such as collective communication and training configurations) required by training jobs into containers. Also, it obtains the processor information on the container running training jobs and automatically generates the collective communication information required by distributed training jobs.

Conditions of Use

Install required components before enabling resumable training by referring to Required Component.
Resumable training is a high-level feature of MindCluster cluster scheduling components. Make preparations before enabling this feature by referring to Preparing Kubernetes and Shared Storage.

Parent topic: Feature Description