Recovery Time (MindSpore)

This section describes the optimization items that can be used to shorten the resumable training time on MindSpore, including Fault Detection Time, Training Rollback and Checkpoint Loading Time, and Cache Building Time.

Fault Detection Time

A parameter plane network fault in a cluster may not affect a training job. Therefore, the cluster scheduling components do not forcibly interrupt the job. When the parameter plane network fault affects a training job, the network timeout mechanism of collective communication is triggered. After a default waiting period of 30 minutes, the cluster scheduling components can detect the fault and trigger resumable training. To solve this problem, MindSpore provides a watchdog fault detection function to determine if training jobs are affected and to reduce fault detection time. For details, see Table 1.

Table 1 Description of watchdog fault detection

Function

Watchdog fault detection

Function Highlights

When training is started, a monitoring thread is started at the same time to continuously obtain communication exceptions and task execution exceptions. After a fault is detected, an exception is quickly thrown, the training process is terminated, and rescheduling is triggered.

Instructions

Only MindSpore 2.4 and later versions are supported.

Key Operation

The watchdog fault detection is enabled by default in MindSpore. You do not need to manually configure it. To disable watchdog, add the following fields in bold to the model configuration file.

...
context:
  ascend_config:
    hccl_watchdog: False    
...

Training Rollback and Checkpoint Loading Time

  • Asynchronous checkpoint saving: A training job periodically saves checkpoint files to save parameter information. Once a fault is rectified, training is rolled back from the most recently saved checkpoint file for recovery. Each time a checkpoint file is saved, a specific training period is wasted. To ensure training efficiency, the interval for saving checkpoint files is usually large. However, a larger saving interval indicates longer time wasted for training rollback upon each fault. To solve this problem, MindIO ACP is introduced to asynchronously save checkpoints. For details, see Table 2.
    Table 2 Asynchronous checkpoint saving

    Function

    Asynchronous checkpoint saving

    Function Highlights

    After checkpoints are obtained from the NPU, they are asynchronously written to storage to minimize training loss and the storage period for each checkpoint saving, thereby reducing the training rollback time.

    Instructions

    Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.

    Key Operation

    For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.

  • Efficient checkpoint recovery: During training rollback and recovery, checkpoints must be loaded from storage. Due to the large volume of checkpoint data, directly reading and loading checkpoints from storage takes considerable time. To solve this problem, MindIO ACP is introduced for efficient checkpoint recovery. For details, see Table 3.
    Table 3 Efficient checkpoint recovery

    Function

    Efficient checkpoint recovery

    Function Highlights

    The latest checkpoint is stored in memory. During fault recovery, the checkpoint can be read directly from memory, reducing the checkpoint reading time.

    Instructions

    Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.

    Key Operation

    For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.

Cache Building Time

During resumable training, a computational graph needs to be built. However, this process takes a long time in foundation model scenarios. To solve this problem, MindSpore can store a building cache file during the first building. During fault recovery, the graph building cache in storage can be directly read to reduce the graph building time. For details, see Table 4.

Table 4 Graph building cache

Function

Graph building cache

Function Highlights

During graph compilation, the cache file stored on the storage device is loaded to help reduce compilation time.

Instructions

Only supported by MindSpore 2.3.0 and later versions.

Key Operation

Add the following environment variables to the shell startup script (for example, train_start.sh) for training:
export MS_COMPILER_CACHE_ENABLE=1
export MS_COMPILER_CACHE_ENABLE=1 # Enable graph compilation cache.
export MS_COMPILER_CACHE_PATH=xxx # Set the graph compilation cache path.