Application Scenario
As neural networks and datasets continue to grow in scale, a single server is no longer sufficient to handle large training workloads. To meet this demand, multiple servers equipped with additional AI processors are deployed to form high-density clusters capable of long-term distributed training. However, as the number of hardware devices increases, so does the likelihood of faults, leading to more frequent training interruptions. Ensuring high cluster availability has therefore become a critical challenge.
Reducing the cost of fault recovery is essential to improving availability. At present, recovery often requires manual inspection of hardware and software exceptions, a process that is both labor-intensive and time-consuming. Isolating faulty devices and restarting training jobs further delays progress, diminishing overall efficiency.
Resumable training addresses these issues by introducing the following functions that automatically manage faults during training, minimize recovery time, and significantly enhance the cluster stability and availability.
Key Functions
Function |
Description |
Configuration Procedure |
|---|---|---|
Fault Detection |
Detects over 20 software faults and 90 hardware faults in real time in training scenarios. For function details and principles, see Fault Detection. |
|
Troubleshooting |
Automatically isolates faulty devices without manual intervention. For function details and principles, see Fault Handling. |
|
Training recovery |
Recovers the training status with the finest granularity according to the custom training recovery policy, minimizing training startup time. For function details and principles, see Training Recovery. |
Application Scenario
Scenario Type |
Main Service |
Service Benefit |
|---|---|---|
AI training |
Monitors compute, network, and storage device resources, checks the health of AI environments, and diagnoses AI job faults. |
|
- Given that the training duration of a small-scale model job is short (less than 1 hour), and hardware faults seldom occur, you are not advised using resumable training in this case.
- This feature does not apply to computing power virtualization scenarios.