set_split_strategy_by_size

Description

Sets a backward gradient splitting strategy in a collective communication group based on the proportion of gradient data to implement AllReduce fusion and optimize the collective communication performance.

Prototype

def set_split_strategy_by_size(dataSizeList, group="hccl_world_group")

Parameters

Parameter

Input/Output

Description

dataSizeList

Input

A list.

List of gradient parameter data size percentages.

  • The index ID list of the gradient must be non-negative, and the total percentage of the gradient data size sequence must be 100.
  • A maximum of eight gradient segments are supported.
  • For example, if the model has 150 MB gradient data and needs to be divided into three segments: 90 MB, 30 MB, and 30 MB, set dataSizeList to [60, 20, 20].

group

Input

A string containing a maximum of 128 bytes, including the end character.

Group name, which can be a user-defined value or hccl_world_group. Defaults to hccl_world_group.

Returns

None

Constraints

  • The caller rank must be within the range defined by the group argument passed to this API call. Otherwise, the API call fails.
  • When the backward gradient splitting strategy is set based on both the gradient data size percentage and the gradient index ID, the setting result based on the gradient data size percentage is preferred.
  • If you do not call the gradient splitting API to set the splitting strategy, the default backward gradient splitting strategy is used.

    Default splitting strategy: The optimal splitting location of ResNet-50 is as follows: ResNet-50 is divided into two segments based on the gradient data size. The data size of the first segment is 96.54%, and that of the second segment is 3.46%.

Applicability

Atlas Training Series Product

Example

The following is only a code snippet and cannot be executed. For details about how to call the HCCL Python APIs to perform collective communication, see Sample Code.

1
2
from npu_bridge.npu_init import *
set_split_strategy_by_size([60, 20, 20], "group")