Recovery Time (MindSpore)
This section describes the optimization items that can be used to shorten the resumable training time on MindSpore, including Fault Detection Time, Training Rollback and Checkpoint Loading Time, and Cache Building Time.
Fault Detection Time
A parameter plane network fault in a cluster may not affect a training job. Therefore, the cluster scheduling components do not forcibly interrupt the job. When the parameter plane network fault affects a training job, the network timeout mechanism of collective communication is triggered. After a default waiting period of 30 minutes, the cluster scheduling components can detect the fault and trigger resumable training. To solve this problem, MindSpore provides a watchdog fault detection function to determine if training jobs are affected and to reduce fault detection time. For details, see Table 1.
Function |
Watchdog fault detection |
|---|---|
Function Highlights |
When training is started, a monitoring thread is started at the same time to continuously obtain communication exceptions and task execution exceptions. After a fault is detected, an exception is quickly thrown, the training process is terminated, and rescheduling is triggered. |
Instructions |
Only MindSpore 2.4 and later versions are supported. |
Key Operation |
The watchdog fault detection is enabled by default in MindSpore. You do not need to manually configure it. To disable watchdog, add the following fields in bold to the model configuration file. ...
context:
ascend_config:
hccl_watchdog: False
...
|
Training Rollback and Checkpoint Loading Time
- Asynchronous checkpoint saving: A training job periodically saves checkpoint files to save parameter information. Once a fault is rectified, training is rolled back from the most recently saved checkpoint file for recovery. Each time a checkpoint file is saved, a specific training period is wasted. To ensure training efficiency, the interval for saving checkpoint files is usually large. However, a larger saving interval indicates longer time wasted for training rollback upon each fault. To solve this problem, MindIO ACP is introduced to asynchronously save checkpoints. For details, see Table 2.
Table 2 Asynchronous checkpoint saving Function
Asynchronous checkpoint saving
Function Highlights
After checkpoints are obtained from the NPU, they are asynchronously written to storage to minimize training loss and the storage period for each checkpoint saving, thereby reducing the training rollback time.
Instructions
Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.
Key Operation
For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.
- Efficient checkpoint recovery: During training rollback and recovery, checkpoints must be loaded from storage. Due to the large volume of checkpoint data, directly reading and loading checkpoints from storage takes considerable time. To solve this problem, MindIO ACP is introduced for efficient checkpoint recovery. For details, see Table 3.
Table 3 Efficient checkpoint recovery Function
Efficient checkpoint recovery
Function Highlights
The latest checkpoint is stored in memory. During fault recovery, the checkpoint can be read directly from memory, reducing the checkpoint reading time.
Instructions
Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.
Key Operation
For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.
Cache Building Time
During resumable training, a computational graph needs to be built. However, this process takes a long time in foundation model scenarios. To solve this problem, MindSpore can store a building cache file during the first building. During fault recovery, the graph building cache in storage can be directly read to reduce the graph building time. For details, see Table 4.
Function |
Graph building cache |
|---|---|
Function Highlights |
During graph compilation, the cache file stored on the storage device is loaded to help reduce compilation time. |
Instructions |
Only supported by MindSpore 2.3.0 and later versions. |
Key Operation |
Add the following environment variables to the shell startup script (for example, train_start.sh) for training:
export MS_COMPILER_CACHE_ENABLE=1 export MS_COMPILER_CACHE_ENABLE=1 # Enable graph compilation cache. export MS_COMPILER_CACHE_PATH=xxx # Set the graph compilation cache path. |