alltoallvc

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	☓
Atlas inference products	☓
Atlas training products	√

Function Description

Sends data (with the customized data size) to all ranks in the collective communicator and receives data from all ranks.

alltoallvc passes the RX and TX parameters of all ranks through the argument send_count_matrix, which outperforms alltoallv.

Function Prototype

      
           def all_to_all_v_c(send_data, send_count_matrix, rank, fusion=0, fusion_id=-1, group="hccl_world_group")

Parameters

Option	Input/Output	Description
send_data	Input	Data to be sent. TensorFlow tensor type. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are int8, uint8, int16, uint16, int32, uint32, int64, uint64, float16, float32, float64, and bfp16. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are int8, uint8, int16, uint16, int32, uint32, int64, uint64, float16, float32, float64, and bfp16. For the Atlas training products , the supported data types are int8, uint8, int16, uint16, int32, uint32, int64, uint64, float16, float32, and float64.
send_count_matrix	Input	TX and RX parameters of all ranks. send_count_matrix[i][j] indicates the amount of data sent from rank i to rank j. The basic unit is the number of bytes of send_data_type. For example, if send_data_type is set to int32 and send_count_matrix[0][1] is set to 1, then rank 0 sends one int32 to rank 1. TensorFlow tensor type, with the data type of int64.
rank	Input	Int type. Rank ID in the group.
fusion	Input	Int type. alltoallvc operator fusion flag. The values are as follows: 0: The alltoallvc operator is not fused with other alltoallvc operators during network compilation. 2: alltoallvc operators with the same fusion_id are fused during network compilation. Note: The prerequisite for alltoallvc operator fusion is that the alltoallvc operators with the same fusion_id must be in the same communicator and the types of their sent data must be the same.
fusion_id	Input	alltoallvc operator fusion ID. Int type. This parameter needs to be configured when alltoallvc operator fusion is enabled. The value range is [0, 0x7fffffff].
group	Input	Group name, which can be a user-defined value or hccl_world_group. A string containing a maximum of 128 bytes, including the end character.

Returns

The result tensor

Constraints

The rank that calls this API must be within the range defined by the argument group of the current API. The entered rank ID must be valid and unique. Otherwise, the API call fails.
For the Atlas training products , the alltoallvc communicators must meet the following requirement:
The communicators of 1p and 2p in a single server must be in the same cluster (with devices 0–3 and devices 4–7 each belonging to a separate cluster). In the communicators of 4p and 8p in a single server and multiple servers, the ranks must be based on the clusters, and the selected clusters in servers must be consistent.
The value of send_count_matrix on each node must be the same.
The performance of the alltoallvc operation is related to the size of the buffer for storing shared data between NPUs. When the communication data size exceeds the buffer size, the performance deteriorates significantly. If the amount of alltoallvc communication data is large, you are advised to increase the buffer size by setting the environment variable HCCL_BUFFSIZE to improve the communication performance. For details about HCCL_BUFFSIZE, see Environment Variables.
For the Atlas training products , if a single server is used, the NIC must be in the up state. Otherwise, this API fails to be executed.

Example

      
           from npu_bridge.hccl import hccl_ops
send_data_tensor = tf.random_uniform((1, 3), minval=1, maxval=10, dtype=tf.float32)
send_counts_matrix_tensor = tf.Variable( [[3,3],[3,3]], dtype=tf.int64)
all_to_all_v_c = hccl_ops.all_to_all_v_c(send_data_tensor, send_counts_matrix_tensor, 0)

Parent topic: npu_bridge.hccl.hccl_ops