Recovery Time (PyTorch)

This section describes the optimization items that can be used to shorten the resumable training time on PyTorch, including Fault Detection Time, Collective Communication Initialization Time, Training Rollback and Checkpoint Loading Time, and Operator Building Time.

Fault Detection Time

A parameter plane network fault in a cluster may not affect a training job. Therefore, the cluster scheduling components do not forcibly interrupt the job. When the parameter plane network fault affects a training job, the network timeout mechanism of collective communication is triggered. After a default waiting period of 30 minutes, the cluster scheduling components can detect the fault and trigger resumable training. To solve this problem, the PyTorch Adapter plugin (torch_npu) provides a watchdog fault detection function to determine if training jobs are affected and to reduce fault detection time. For details, see Table 1.

Table 1 Description of watchdog fault detection

Function

Watchdog fault detection

Function Highlights

When training is started, a monitoring thread is started at the same time to continuously obtain communication exceptions and job execution exceptions. After a fault is detected, an exception is quickly thrown, the training process is terminated, and rescheduling is triggered.

Instructions

Only PyTorch 1.11.0, 2.1.0, and later versions are supported. The version of torch_npu must be later than 6.0.RC1.

Key Operation

In PyTorch 2.1.0 and later versions, watchdog fault detection is enabled by default. You do not need to manually configure environment variables.

(Optional) To disable watchdog fault detection, modify the following environment variables in the shell startup script, for example, train_start.sh.

...
# env for breakpoint ckpt
export RESUME_MODE_ENABLE=1

export HCCL_ASYNC_ERROR_HANDLING=0               # For details about this environment variable, see Table 5.

Collective Communication Initialization Time

Parallel Store multi-thread link setup optimization: When PyTorch creates communication groups, TCP Store is used for information exchange. As the job scale increases, the information processing performance of the native TCP Store degrades, leading to prolonged times for creating communication groups. To solve this problem, torch_npu supports the optimized Parallel Store built on the native TCP Store. For details, see Table 2.

Table 2 Parallel Store description

Function

Parallel Store

Function Highlights

During multiple threads process link setup, this function can reduce both the waiting time of the link setup request queue and the overall link setup time.

Instructions

For PyTorch 1.11.0, the version of torch_npu must be later than 6.0.RC1.

For PyTorch 2.1.0 or later, the version of torch_npu must be later than 6.0.RC3.

Key Operation

In the shell script (for example, train_start.sh) for starting training, change the torchrun command to torch_npu_run. For example:

Change

torchrun train.py --train_parameter=xxx ....

To

torch_npu_run train.py --train_parameter=xxx ....
  • Performance optimization of native HCCL link setup: PyTorch sets up a link between NPUs after the collective communication information is exchanged on the NPU. As the job scale increases, the link setup time increases significantly. To solve this problem, CANN optimizes the performance of the native HCCL link setup. For details, see Table 3.
    Table 3 Description of native HCCL link setup performance optimization

    Function

    Native HCCL link setup performance optimization

    Function Highlights

    By asynchronously completing collective communication information negotiation, multiple threads reduce both the negotiation time and the overall link setup time.

    Instructions

    Only CANN 8.0.RC2 and later versions are supported.

    Key Operation

    None

  • Link setup optimization in RankTable mode: Ascend Operator provides the function of generating a collective communication configuration file (RankTable file, also called hccl.json) for PyTorch. Links can be set up in RankTable mode to shorten cluster communication link setup time. For details, see Table 4.
    Table 4 Link setup for collective communication in RankTable mode

    Function

    Link setup in RankTable mode

    Function Highlights

    Ascend Operator is used to generate a collective communication configuration file for PyTorch tasks to shorten cluster communication link setup time.

    Instructions

    The version of torch_npu must be later than 6.0.RC3.

    Key Operation

    1. By default, the parent directory of the hccl.json file is mounted to the startup YAML file. You can change the directory as required.
      volumes:
             - name: ranktable-dir
               hostPath:
                 path: /user/mindx-dl/ranktable  # The host directory must be in the shared directory.
                 type: DirectoryOrCreate
      Run the following commands to create a mount path for the hccl.json file in the host directory and change the owner:
      mkdir -m 777 /user/mindx-dl/ranktable/Namespace_where_the_job_is_running.Job_name
      chown 9000:9000 /user/mindx-dl/ranktable/default.pytorch-test
      For example:
      mkdir -m 777 /user/mindx-dl/ranktable/default.pytorch-test
      chown 9000:9000 /user/mindx-dl/ranktable/default.pytorch-test
    2. Add the following environment variable to the training script:
      export RANK_TABLE_FILE=/user/mindx-dl/ranktable/hccl.json
    3. Modify the training YAML file and add the following settings:
      yaml
            volumeMounts:
            - name: ranktable
              mountPath: /user/mindx-dl/ranktable
                       
             volumes:
             - name: ranktable
               hostPath:
                 path: /user/mindx-dl/ranktable/Namespace where a job is running.Job name  # Actual path of the hccl.json file in the host directory

Training Rollback and Checkpoint Loading Time

  • Asynchronous checkpoint saving: A training job periodically saves checkpoint files to save parameter information. Once a fault is rectified, training is rolled back from the most recently saved checkpoint file for recovery. Each time a checkpoint file is saved, a specific training period is wasted. To ensure training efficiency, the interval for saving checkpoint files is usually large. However, a larger saving interval indicates longer time wasted for training rollback upon each fault. To solve this problem, MindIO ACP is introduced to asynchronously save checkpoints. For details, see Table 5.
    Table 5 Asynchronous checkpoint saving

    Function

    Asynchronous checkpoint saving

    Function Highlights

    After checkpoints are obtained from the NPU, they are asynchronously written to storage to minimize training loss and the storage period for each checkpoint saving, thereby reducing the training rollback time.

    Instructions

    Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.

    Key Operation

    For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.

  • Efficient checkpoint recovery: During training rollback and recovery, checkpoints must be loaded from storage. Due to the large volume of checkpoint data, directly reading and loading checkpoints from storage takes considerable time. To solve this problem, MindIO ACP is introduced for efficient checkpoint recovery. For details, see Table 6.
    Table 6 Efficient checkpoint recovery

    Function

    Efficient checkpoint recovery

    Function Highlights

    MindIO stores the latest checkpoint in memory. During fault recovery, the checkpoint can be read directly from memory, reducing checkpoint reading time.

    Instructions

    Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.

    Key Operation

    For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.

Operator Building Time

If an operator needs to be re-executed during resumable training, building the operator takes a long time. To solve this problem, you can select the operator binary or operator building cache to reduce the building time. For details, see Table 7 and Table 8.

The operator binary and operator building cache are incompatible. Select either of them.

Table 7 Operator binary description

Function

Operator binary

Function Highlights

During operator building, the preset operator binary is loaded in advance so that the operator can be executed without building.

Instructions

Only CANN 8.0.RC2 and later versions are supported.

Key Operation

In the Python startup script, add the operator binary configuration command to enable operator binary.

torch.npu.set_compile_mode(jit_compile=False)
Table 8 Operator building cache description

Function

Operator building cache

Function Highlights

During operator building, the operator building cache file saved on the storage device is loaded, which can reduce the building time.

Instructions

Only CANN 8.0.RC2 and later versions are supported.

Key Operation

  1. In the Python startup script, add the following command to enable operator build cache.
    torch.npu.set_compile_mode(jit_compile=True)
  2. Add the following environment variables to the shell startup script (for example, train_start.sh) for training:
    export ASCEND_CACHE_PATH=xxx    # Add a shared storage path.
    export ASCEND_MAX_OP_CACHE_SIZE=-1    # Enable this environment variable when using shared storage to solve the problem of resource preemption when multiple nodes read the shared storage cache.