Recovery Time (PyTorch)
This section describes the optimization items that can be used to shorten the resumable training time on PyTorch, including Fault Detection Time, Collective Communication Initialization Time, Training Rollback and Checkpoint Loading Time, and Operator Building Time.
Fault Detection Time
A parameter plane network fault in a cluster may not affect a training job. Therefore, the cluster scheduling components do not forcibly interrupt the job. When the parameter plane network fault affects a training job, the network timeout mechanism of collective communication is triggered. After a default waiting period of 30 minutes, the cluster scheduling components can detect the fault and trigger resumable training. To solve this problem, the PyTorch Adapter plugin (torch_npu) provides a watchdog fault detection function to determine if training jobs are affected and to reduce fault detection time. For details, see Table 1.
Function |
Watchdog fault detection |
|---|---|
Function Highlights |
When training is started, a monitoring thread is started at the same time to continuously obtain communication exceptions and job execution exceptions. After a fault is detected, an exception is quickly thrown, the training process is terminated, and rescheduling is triggered. |
Instructions |
Only PyTorch 1.11.0, 2.1.0, and later versions are supported. The version of torch_npu must be later than 6.0.RC1. |
Key Operation |
In PyTorch 2.1.0 and later versions, watchdog fault detection is enabled by default. You do not need to manually configure environment variables. (Optional) To disable watchdog fault detection, modify the following environment variables in the shell startup script, for example, train_start.sh. ... # env for breakpoint ckpt export RESUME_MODE_ENABLE=1 export HCCL_ASYNC_ERROR_HANDLING=0 # For details about this environment variable, see Table 5. |
Collective Communication Initialization Time
Parallel Store multi-thread link setup optimization: When PyTorch creates communication groups, TCP Store is used for information exchange. As the job scale increases, the information processing performance of the native TCP Store degrades, leading to prolonged times for creating communication groups. To solve this problem, torch_npu supports the optimized Parallel Store built on the native TCP Store. For details, see Table 2.
Function |
Parallel Store |
|---|---|
Function Highlights |
During multiple threads process link setup, this function can reduce both the waiting time of the link setup request queue and the overall link setup time. |
Instructions |
For PyTorch 1.11.0, the version of torch_npu must be later than 6.0.RC1. For PyTorch 2.1.0 or later, the version of torch_npu must be later than 6.0.RC3. |
Key Operation |
In the shell script (for example, train_start.sh) for starting training, change the torchrun command to torch_npu_run. For example: Change torchrun train.py --train_parameter=xxx .... To torch_npu_run train.py --train_parameter=xxx .... |
- Performance optimization of native HCCL link setup: PyTorch sets up a link between NPUs after the collective communication information is exchanged on the NPU. As the job scale increases, the link setup time increases significantly. To solve this problem, CANN optimizes the performance of the native HCCL link setup. For details, see Table 3.
Table 3 Description of native HCCL link setup performance optimization Function
Native HCCL link setup performance optimization
Function Highlights
By asynchronously completing collective communication information negotiation, multiple threads reduce both the negotiation time and the overall link setup time.
Instructions
Only CANN 8.0.RC2 and later versions are supported.
Key Operation
None
- Link setup optimization in RankTable mode: Ascend Operator provides the function of generating a collective communication configuration file (RankTable file, also called hccl.json) for PyTorch. Links can be set up in RankTable mode to shorten cluster communication link setup time. For details, see Table 4.
Table 4 Link setup for collective communication in RankTable mode Function
Link setup in RankTable mode
Function Highlights
Ascend Operator is used to generate a collective communication configuration file for PyTorch tasks to shorten cluster communication link setup time.
Instructions
The version of torch_npu must be later than 6.0.RC3.
Key Operation
- By default, the parent directory of the hccl.json file is mounted to the startup YAML file. You can change the directory as required.
volumes: - name: ranktable-dir hostPath: path: /user/mindx-dl/ranktable # The host directory must be in the shared directory. type: DirectoryOrCreateRun the following commands to create a mount path for the hccl.json file in the host directory and change the owner:mkdir -m 777 /user/mindx-dl/ranktable/Namespace_where_the_job_is_running.Job_name chown 9000:9000 /user/mindx-dl/ranktable/default.pytorch-test
For example:mkdir -m 777 /user/mindx-dl/ranktable/default.pytorch-test chown 9000:9000 /user/mindx-dl/ranktable/default.pytorch-test
- Add the following environment variable to the training script:
export RANK_TABLE_FILE=/user/mindx-dl/ranktable/hccl.json
- Modify the training YAML file and add the following settings:
yaml volumeMounts: - name: ranktable mountPath: /user/mindx-dl/ranktable volumes: - name: ranktable hostPath: path: /user/mindx-dl/ranktable/Namespace where a job is running.Job name # Actual path of the hccl.json file in the host directory
- By default, the parent directory of the hccl.json file is mounted to the startup YAML file. You can change the directory as required.
Training Rollback and Checkpoint Loading Time
- Asynchronous checkpoint saving: A training job periodically saves checkpoint files to save parameter information. Once a fault is rectified, training is rolled back from the most recently saved checkpoint file for recovery. Each time a checkpoint file is saved, a specific training period is wasted. To ensure training efficiency, the interval for saving checkpoint files is usually large. However, a larger saving interval indicates longer time wasted for training rollback upon each fault. To solve this problem, MindIO ACP is introduced to asynchronously save checkpoints. For details, see Table 5.
Table 5 Asynchronous checkpoint saving Function
Asynchronous checkpoint saving
Function Highlights
After checkpoints are obtained from the NPU, they are asynchronously written to storage to minimize training loss and the storage period for each checkpoint saving, thereby reducing the training rollback time.
Instructions
Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.
Key Operation
For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.
- Efficient checkpoint recovery: During training rollback and recovery, checkpoints must be loaded from storage. Due to the large volume of checkpoint data, directly reading and loading checkpoints from storage takes considerable time. To solve this problem, MindIO ACP is introduced for efficient checkpoint recovery. For details, see Table 6.
Table 6 Efficient checkpoint recovery Function
Efficient checkpoint recovery
Function Highlights
MindIO stores the latest checkpoint in memory. During fault recovery, the checkpoint can be read directly from memory, reducing checkpoint reading time.
Instructions
Only cluster scheduling components and MindIO components of 6.0.RC2 and later are supported.
Key Operation
For details about how to install and use MindIO, see Checkpoint Saving, Loading, and Optimization.
Operator Building Time
The operator binary and operator building cache are incompatible. Select either of them.
Function |
Operator binary |
|---|---|
Function Highlights |
During operator building, the preset operator binary is loaded in advance so that the operator can be executed without building. |
Instructions |
Only CANN 8.0.RC2 and later versions are supported. |
Key Operation |
In the Python startup script, add the operator binary configuration command to enable operator binary. torch.npu.set_compile_mode(jit_compile=False) |
Function |
Operator building cache |
|---|---|
Function Highlights |
During operator building, the operator building cache file saved on the storage device is loaded, which can reduce the building time. |
Instructions |
Only CANN 8.0.RC2 and later versions are supported. |
Key Operation |
|