reduce_scatter

Description

Performs the sum operation (or other reduction operations) on the inputs of all ranks, and then distributes the result evenly to the output buffers of ranks according to the rank IDs. Each process receives 1/ranksize portion of data from other processes for reduction.

Prototype

def reduce_scatter(tensor, reduction, rank_size, group = "hccl_world_group", fusion=0, fusion_id=-1)

Parameters

Parameter

Input/Output

Description

tensor

Input

TensorFlow tensor type.

Atlas Training Series Product: The supported data types are int8, int32, int64, float16, and float32.

Note that the size of the first dimension of a tensor must be an integer multiple of the rank size.

reduction

Input

A string.

Reduction operation types, which can be max, min, prod, and sum.

NOTE:

rank_size

Input

An int.

Number of devices in a group.

Maximum value: 32768.

group

Input

A string containing a maximum of 128 bytes, including the end character.

Group name, which can be a user-defined value or hccl_world_group.

fusion

Input

An int.

ReduceScatter operator fusion flag. The values are as follows:

  • 0: The ReduceScatter operator is not fused with other ReduceScatter operators during network compilation.
  • 2: ReduceScatter operators with the same fusion_id are fused during network compilation.

fusion_id

Input

An int.

ReduceScatter operator fusion ID.

If fusion is set to 2, ReduceScatter operators with the same fusion_id are fused during network compilation.

Returns

The result tensor. It is recommended that the result tensor size be 32-byte aligned. Otherwise, the performance deteriorates.

Constraints

  • The caller rank must be within the range defined by the group argument passed to this API call. Otherwise, the API call fails.
  • The input tensor size must be less than or equal to 8 GB.

Applicability

Atlas Training Series Product

Example

The following is only a code snippet and cannot be executed. For details about how to call the HCCL Python APIs to perform collective communication, see Sample Code.

1
2
3
4
from npu_bridge.npu_init import *
tensor = tf.random_uniform((2, 3), minval=1, maxval=10, dtype=tf.float32)
rank_size = 2
result = hccl_ops.reduce_scatter(tensor, "sum", rank_size)