Adjusting Gradient Segmentation Policy
Background
In the distributed training scenario, gradient aggregation is performed after gradients between devices are calculated. Gradient data is generated in order and does not change after being generated. To improve training performance, the gradient parameter data may be segmented. Gradient aggregation may be immediately started after gradient data of a segment is generated, so that some gradient parameter data is aggregated and forward and backward time is executed in parallel.
The default segmentation policy is two segments with the first taking up 96.54% of the data volume, and the second segment taking up 3.46% of the data volume (in some cases, the data is not segmented). This segmentation policy may not be applicable to other networks due to the data volume and calculation time differences of different network gradients. You can adjust the distributed gradient segmentation policy by referring to this section to improve the training performance in distributed scenarios.
Determining Gradient Segmentation Policy
You need to use the Profiling tool to analyze the iteration traces of the training process to determine the gradient segmentation policy and improve the training performance in distributed scenarios.
Iteration tracing is to trace the software status of a training job and the Ascend AI Software Stack, which can be used to analyze the performance of a training job. If the default two-segment gradient segmentation policy is applied, the following iteration traces of a training job are printed to describe the job execution status in an iteration: fp_start, bp_end, allreduce1_start, allreduce1_end, allreduce2_start, allreduce2_end, and Iteration_end in the training job.

An optimal gradient data segmentation policy meets the following rules:
- Make AR1 hidden within the FPBP period. By default, allreduce and forward and backward propagation proceed in serial. You can enable the parallel execution of allreduce and forward and backward propagation by setting hcom_parallel to True. For details, see Adjusting Gradient Segmentation Policy.
- Keep AR2 as short as possible to reduce the collective communication hangover after computation.
Based on the preceding segmentation rules, you can adjust the gradient segmentation policy to improve the training performance in distributed scenarios. The following uses two-segment gradient segmentation as an example to describe how to determine a gradient segmentation policy with three optimization scenarios.
[Scenario 1] When AR1 starts early and AR2 is long, move backward the segmentation point to shorten AR2.
For example, if the two gradient segments have the same data volume, the segmentation diagram is as follows.

If the data volume of the first gradient segment is increased to 80%, the segmentation diagram is as follows.

[Scenario 2] When AR1 starts late and ends later than FPBP, move forward the segmentation point to hide AR1 within FPBP.
If the data volume of the first gradient segment is 90%, the segmentation diagram is as follows.

If the data volume of the first gradient segment is decreased to 80%, the segmentation diagram is as follows.

[Scenario 3] You may get a long hangover when AR1 has most gradient data, especially when FPBP is time-consuming and data-hungry in two-segment gradient segmentation. In this case, refer to scenario 2. If AR2 is long because it takes most gradient data, see scenario 1. You may choose to add more segments within FPBP, which is quite long, for better parallelism.

Adjusting Gradient Segmentation Policy
You can use the gradient segmentation API in the training script to set the AllReduce segmentation and fusion policy in the backward propagation phase.
set_split_strategy_by_idx: sets the gradient descent segmentation policy in the collective communication group based on the gradient index ID.
1 2 |
from hccl.split.api import set_split_strategy_by_idx set_split_strategy_by_idx([20, 100, 159]) |
set_split_strategy_by_size: sets the gradient splitting strategy in the collective communication group by percent.
1 2 |
from hccl.split.api import set_split_strategy_by_size set_split_strategy_by_size([60, 20, 20]) |
For details about the detailed API description of set_split_strategy_by_idx and set_split_strategy_by_size, see HCCL API (Python).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import tensorflow as tf from npu_bridge.npu_init import * npu_init = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system() config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True # Enable iteration tracing. custom_op.parameter_map["profiling_mode"].b = True custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/home/HwHiAiUser/output","task_trace":"on","training_trace":"on","fp_point":"","bp_point":""}') # Enable the parallel execution of AllReduce and forward and backward propagation. custom_op.parameter_map["hcom_parallel"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF with tf.Session(config=config) as sess: # Initialize collective communication. sess.run(npu_init) # Set the gradient descent segmentation policy. set_split_strategy_by_size([80, 20]) # Perform AllReduce... # Perform training... sess.run(npu_shutdown) |
