set_split_strategy_by_idx

Description

Sets a backward gradient splitting strategy in a collective communication group based on the gradient index ID to implement AllReduce fusion and optimize the collective communication performance.

Prototype

def set_split_strategy_by_idx(idxList, group="hccl_world_group")

Parameters

Parameter	Input/Output	Description
idxList	Input	A list. Index ID list of gradients. The index ID list of the gradient must be a non-negative number in ascending order. The gradient index ID must be set based on the total number of gradient parameters of the model. The index ID starts from 0. The maximum value can be obtained as follows: Do not call the gradient splitting API to set the gradient splitting strategy for training. In this case, the script uses the default gradient splitting mode in set_split_strategy_by_size for training. After training, search for the keyword "segment result" in the training information log on the host to obtain the gradient splitting detail, for example, segment index list: [0, 107] [108, 159]. The maximum number (for example, 159) in the segment sequence is the maximum value of the total gradient parameter index. NOTE: During the training process, logs may be overwritten. In this case, you can modify LogAgentMaxFileNum in /var/log/npu/conf/slog/slog.conf to increase the number of log files that can be stored on the host. Alternatively, you can also perform only one iteration. A maximum of eight gradient segments are supported. For example, if a model has 160 parameters that generate gradients and need to be divided into three segments: [0, 20], [21, 100], and [101, 159], set idxList to [20, 100, 159].
group	Input	A string. Group name, which can be a user-defined value or hccl_world_group. Defaults to hccl_world_group.

Parameter

Input/Output

Description

idxList

Input

A list.

Index ID list of gradients.

The index ID list of the gradient must be a non-negative number in ascending order.
The gradient index ID must be set based on the total number of gradient parameters of the model. The index ID starts from 0. The maximum value can be obtained as follows:
- Do not call the gradient splitting API to set the gradient splitting strategy for training. In this case, the script uses the default gradient splitting mode in set_split_strategy_by_size for training.
- After training, search for the keyword "segment result" in the training information log on the host to obtain the gradient splitting detail, for example, segment index list: [0, 107] [108, 159]. The maximum number (for example, 159) in the segment sequence is the maximum value of the total gradient parameter index.
  NOTE:
  During the training process, logs may be overwritten. In this case, you can modify LogAgentMaxFileNum in /var/log/npu/conf/slog/slog.conf to increase the number of log files that can be stored on the host. Alternatively, you can also perform only one iteration.
A maximum of eight gradient segments are supported.
For example, if a model has 160 parameters that generate gradients and need to be divided into three segments: [0, 20], [21, 100], and [101, 159], set idxList to [20, 100, 159].

group

Input

A string.

Group name, which can be a user-defined value or hccl_world_group. Defaults to hccl_world_group.

Returns

None

Constraints

The caller rank must be within the range defined by the group argument passed to this API call. Otherwise, the API call fails.
If you do not call the gradient splitting API to set the splitting strategy, the default backward gradient splitting strategy is used.
Default splitting strategy: two segments with the first taking up 96.54% of the gradient data size, and the second segment taking up 3.46% (In some cases, there is only one segment).

Applicability

Atlas Training Series Product

Example

The following is only a code snippet and cannot be executed. For details about how to call the HCCL Python APIs to perform collective communication, see Sample Code.

from npu_bridge.npu_init import *
set_split_strategy_by_idx([20, 100, 159], "group")

Parent topic: hccl.split.api