Recovery Time (PyTorch)

This section describes the optimization items that can be used to shorten the resumable training time on PyTorch, including Fault Detection Time, Collective Communication Initialization Time, Training Rollback and Checkpoint Loading Time, and Operator Building Time.

Fault Detection Time

A parameter plane network fault in a cluster may not affect a training job. Therefore, the cluster scheduling components do not forcibly interrupt the job. When the parameter plane network fault affects a training job, the network timeout mechanism of collective communication is triggered. After a default waiting period of 30 minutes, the cluster scheduling components can detect the fault and trigger resumable training. To solve this problem, the PyTorch Adapter plugin (torch_npu) provides a watchdog fault detection function to determine if training jobs are affected and to reduce fault detection time. For details, see Table 1.

**Table 1** Description of watchdog fault detection
Function	Watchdog fault detection
Function Highlights	When training is started, a monitoring thread is started at the same time to continuously obtain communication exceptions and job execution exceptions. After a fault is detected, an exception is quickly thrown, the training process is terminated, and rescheduling is triggered.
Instructions	Only PyTorch 1.11.0, 2.1.0, and later versions are supported. The version of torch_npu must be later than 6.0.RC1.
Key Operation	In PyTorch 2.1.0 and later versions, watchdog fault detection is enabled by default. You do not need to manually configure environment variables. (Optional) To disable watchdog fault detection, modify the following environment variables in the shell startup script, for example, train_start.sh. ... # env for breakpoint ckpt export RESUME_MODE_ENABLE=1 export HCCL_ASYNC_ERROR_HANDLING=0 # For details about this environment variable, see Table 5.

Collective Communication Initialization Time

Parallel Store multi-thread link setup optimization: When PyTorch creates communication groups, TCP Store is used for information exchange. As the job scale increases, the information processing performance of the native TCP Store degrades, leading to prolonged times for creating communication groups. To solve this problem, torch_npu supports the optimized Parallel Store built on the native TCP Store. For details, see Table 2.

**Table 2** Parallel Store description
Function	Parallel Store
Function Highlights	During multiple threads process link setup, this function can reduce both the waiting time of the link setup request queue and the overall link setup time.
Instructions	For PyTorch 1.11.0, the version of torch_npu must be later than 6.0.RC1. For PyTorch 2.1.0 or later, the version of torch_npu must be later than 6.0.RC3.
Key Operation	In the shell script (for example, train_start.sh) for starting training, change the torchrun command to torch_npu_run. For example: Change torchrun train.py --train_parameter=xxx .... To torch_npu_run train.py --train_parameter=xxx ....

Performance optimization of native HCCL link setup: PyTorch sets up a link between NPUs after the collective communication information is exchanged on the NPU. As the job scale increases, the link setup time increases significantly. To solve this problem, CANN optimizes the performance of the native HCCL link setup. For details, see Table 3.

**Table 3** Description of native HCCL link setup performance optimization
Function	Native HCCL link setup performance optimization
Function Highlights	By asynchronously completing collective communication information negotiation, multiple threads reduce both the negotiation time and the overall link setup time.
Instructions	Only CANN 8.0.RC2 and later versions are supported.
Key Operation	None

Link setup optimization in RankTable mode: Ascend Operator provides the function of generating a collective communication configuration file (RankTable file, also called hccl.json) for PyTorch. Links can be set up in RankTable mode to shorten cluster communication link setup time. For details, see Table 4.

**Table 4** Link setup for collective communication in RankTable mode
Function	Link setup in RankTable mode
Function Highlights	Ascend Operator is used to generate a collective communication configuration file for PyTorch tasks to shorten cluster communication link setup time.
Instructions	The version of torch_npu must be later than 6.0.RC3.
Key Operation	By default, the parent directory of the hccl.json file is mounted to the startup YAML file. You can change the directory as required. volumes: - name: ranktable-dir hostPath: path: /user/mindx-dl/ranktable # The host directory must be in the shared directory. type: DirectoryOrCreate Run the following commands to create a mount path for the hccl.json file in the host directory and change the owner: mkdir -m 777 /user/mindx-dl/ranktable/Namespace_where_the_job_is_running.Job_name chown 9000:9000 /user/mindx-dl/ranktable/default.pytorch-test For example: mkdir -m 777 /user/mindx-dl/ranktable/default.pytorch-test chown 9000:9000 /user/mindx-dl/ranktable/default.pytorch-test Add the following environment variable to the training script: export RANK_TABLE_FILE=/user/mindx-dl/ranktable/hccl.json Modify the training YAML file and add the following settings: yaml volumeMounts: - name: ranktable mountPath: /user/mindx-dl/ranktable volumes: - name: ranktable hostPath: path: /user/mindx-dl/ranktable/Namespace where a job is running.Job name # Actual path of the hccl.json file in the host directory

Training Rollback and Checkpoint Loading Time

Asynchronous checkpoint saving: A training job periodically saves checkpoint files to save parameter information. Once a fault is rectified, training is rolled back from the most recently saved checkpoint file for recovery. Each time a checkpoint file is saved, a specific training period is wasted. To ensure training efficiency, the interval for saving checkpoint files is usually large. However, a larger saving interval indicates longer time wasted for training rollback upon each fault. To solve this problem, MindIO ACP is introduced to asynchronously save checkpoints. For details, see Table 5.

**Table 5** Asynchronous checkpoint saving
Function	Asynchronous checkpoint saving
Function Highlights	After checkpoints are obtained from the NPU, they are asynchronously written to storage to minimize training loss and the storage period for each checkpoint saving, thereby reducing the training rollback time.
Instructions	Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.
Key Operation	For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.

Efficient checkpoint recovery: During training rollback and recovery, checkpoints must be loaded from storage. Due to the large volume of checkpoint data, directly reading and loading checkpoints from storage takes considerable time. To solve this problem, MindIO ACP is introduced for efficient checkpoint recovery. For details, see Table 6.

**Table 6** Efficient checkpoint recovery
Function	Efficient checkpoint recovery
Function Highlights	MindIO stores the latest checkpoint in memory. During fault recovery, the checkpoint can be read directly from memory, reducing checkpoint reading time.
Instructions	Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.
Key Operation	For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.

Operator Building Time

If an operator needs to be re-executed during resumable training, building the operator takes a long time. To solve this problem, you can select the operator binary or operator building cache to reduce the building time. For details, see Table 7 and Table 8.

The operator binary and operator building cache are incompatible. Select either of them.

**Table 7** Operator binary description
Function	Operator binary
Function Highlights	During operator building, the preset operator binary is loaded in advance so that the operator can be executed without building.
Instructions	Only CANN 8.0.RC2 and later versions are supported.
Key Operation	In the Python startup script, add the operator binary configuration command to enable operator binary. torch.npu.set_compile_mode(jit_compile=False)

**Table 8** Operator building cache description
Function	Operator building cache
Function Highlights	During operator building, the operator building cache file saved on the storage device is loaded, which can reduce the building time.
Instructions	Only CANN 8.0.RC2 and later versions are supported.
Key Operation	In the Python startup script, add the following command to enable operator build cache. torch.npu.set_compile_mode(jit_compile=True) Add the following environment variables to the shell startup script (for example, train_start.sh) for training: export ASCEND_CACHE_PATH=xxx # Add a shared storage path. export ASCEND_MAX_OP_CACHE_SIZE=-1 # Enable this environment variable when using shared storage to solve the problem of resource preemption when multiple nodes read the shared storage cache.

Parent topic: Optimizing the Integration Time