Application Scenario

As neural networks and datasets continue to grow in scale, a single server is no longer sufficient to handle large training workloads. To meet this demand, multiple servers equipped with additional AI processors are deployed to form high-density clusters capable of long-term distributed training. However, as the number of hardware devices increases, so does the likelihood of faults, leading to more frequent training interruptions. Ensuring high cluster availability has therefore become a critical challenge.

Reducing the cost of fault recovery is essential to improving availability. At present, recovery often requires manual inspection of hardware and software exceptions, a process that is both labor-intensive and time-consuming. Isolating faulty devices and restarting training jobs further delays progress, diminishing overall efficiency.

Resumable training addresses these issues by introducing the following functions that automatically manage faults during training, minimize recovery time, and significantly enhance the cluster stability and availability.

Key Functions

Function	Description	Configuration Procedure
Fault Detection	Detects over 20 software faults and 90 hardware faults in real time in training scenarios. For function details and principles, see Fault Detection.	(Optional) Configuring Fault Detection Levels
Troubleshooting	Automatically isolates faulty devices without manual intervention. For function details and principles, see Fault Handling.	Configuring Fault Handling Policies
Training recovery	Recovers the training status with the finest granularity according to the custom training recovery policy, minimizing training startup time. For function details and principles, see Training Recovery.	Configuring Training Recovery

Function

Description

Configuration Procedure

Fault Detection

Detects over 20 software faults and 90 hardware faults in real time in training scenarios.

For function details and principles, see Fault Detection.

(Optional) Configuring Fault Detection Levels

Troubleshooting

Automatically isolates faulty devices without manual intervention.

For function details and principles, see Fault Handling.

Configuring Fault Handling Policies

Training recovery

Recovers the training status with the finest granularity according to the custom training recovery policy, minimizing training startup time.

For function details and principles, see Training Recovery.

Configuring Training Recovery

Application Scenario

Scenario Type	Main Service	Service Benefit
AI training	Monitors compute, network, and storage device resources, checks the health of AI environments, and diagnoses AI job faults.	Monitor the overall cluster environment resources. Improve the job success rate of AI training services. Reduce the time for handling and rectifying AI job training faults.

Given that the training duration of a small-scale model job is short (less than 1 hour), and hardware faults seldom occur, you are not advised using resumable training in this case.
This feature does not apply to computing power virtualization scenarios.

Parent topic: Feature Description