Adjusting Gradient Splitting Strategy

Background

In the distributed training scenario, gradient aggregation is performed after gradients between devices are calculated. Gradient data is generated in order and does not change after being generated. To improve training performance, the gradient data may be split. Gradient aggregation may be immediately started after gradient data of a segment is generated, so that the aggregation of a part of gradient data and forward propagation and backward propagation (FPBP) can be performed in parallel.

The default splitting strategy is to split data into two segments, with the first segment taking up 96.54% of the data size and the second segment taking up 3.46% of the data size. (In some cases, the data may not be split.) This splitting strategy may not be applicable to other networks due to the data size and calculation time differences of different network gradients. You can adjust the distributed gradient splitting strategy by referring to this section to improve the training performance in distributed scenarios.

Determining Gradient Splitting Strategy

You need to use the Profiling tool to analyze the iteration traces of the training process to determine the gradient splitting strategy and improve the training performance in distributed scenarios.

For details, see Profiling Instructions.

Iteration tracing is to trace the software status of a training job and the Ascend AI Software Stack, which can be used to analyze the performance of a training job. If the default two-segment gradient segmentation policy is applied, the following iteration traces of a training job are printed to describe the job execution status in an iteration: fp_start, bp_end, allreduce1_start, allreduce1_end, allreduce2_start, allreduce2_end, and Iteration_end in the training job.

An optimal gradient data splitting strategy meets the following rules:

Make AR1 hidden within the FPBP period. By default, AllReduce and forward and backward propagation proceed in serial. You can enable the parallel execution of AllReduce and forward and backward propagation by setting hcom_parallel to True. For details about the configuration method, see Adjusting Gradient Splitting Strategy.
Keep AR2 as short as possible to reduce the collective communication hangover after computation.

Based on the preceding splitting strategies, you can adjust the gradient splitting strategy to improve the training performance in distributed scenarios. The following uses two-segment gradient splitting as an example to describe how to determine a gradient splitting strategy with three optimization scenarios.

[Scenario 1] When AR1 starts early and AR2 is long, move backward the splitting point to shorten AR2.

Assume the first and second gradient segments have the data size of 50% each.

If the data size of the first gradient segment is decreased to 80%, then that of the second gradient segment is 20%.

[Scenario 2] When AR1 starts late and ends later than FPBP, move forward the splitting point to hide AR1 within FPBP.

Assume the first gradient segment has the data size of 90%, and the second gradient segment has the data size of 10%.

If the data size of the first gradient segment is decreased to 80%, then that of the second gradient segment is 20%.

[Scenario 3] You may get a long hangover when AR1 has most gradient data, especially when FPBP is time-consuming and data-hungry in two-segment gradient splitting. In this case, refer to scenario 2. If AR2 is long because it takes most gradient data, see scenario 1. You may choose to add more segments within FPBP, which is quite long, for better parallelism.

Adjusting Gradient Splitting Strategy

You can call the gradient splitting API in the training script to set the AllReduce splitting and fusion strategy in the backward propagation phase. Select either of the following APIs:

set_split_strategy_by_idx: sets the gradient splitting strategy in the collective communication group based on the gradient index.

from hccl.split.api import set_split_strategy_by_idx  
set_split_strategy_by_idx([20, 100, 159])

set_split_strategy_by_size: sets the gradient splitting strategy in the collective communication group by percent.

from hccl.split.api import set_split_strategy_by_size  
set_split_strategy_by_size([60, 20, 20])

Parent topic: Additional Features