alltoallvc
Description
Sends data (with the customized data size) to all ranks in the collective communicator and receives data from all ranks.
alltoallvc passes the RX and TX parameters of all ranks through the argument send_count_matrix, which outperforms alltoallv.
export HCCL_BUFFSIZE=2048
Prototype
def all_to_all_v_c(send_data, send_count_matrix, rank, fusion=0, fusion_id=-1, group="hccl_world_group")
Parameters
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
send_data |
Input |
Data to be sent. TensorFlow tensor type. |
|
send_count_matrix |
Input |
TX and RX parameters of all ranks. send_count_matrix[i][j] indicates the amount of data sent from rank i to rank j. The basic unit is the number of bytes of send_data_type. For example, if send_data_type is set to int32 and send_count_matrix[0][1] is set to 1, then rank 0 sends one int32 to rank 1. TensorFlow tensor type. Must be the int64 type. |
|
rank |
Input |
An int. Rank ID of the node, which is the rank ID in the group. |
|
fusion |
Input |
An int. alltoallvc operator fusion flag. The values are as follows:
|
|
fusion_id |
Input |
alltoallvc operator fusion ID. An int. This parameter needs to be configured when alltoallvc operator fusion is enabled. The value range is [0, 0x7fffffff]. |
|
group |
Input |
Group name, which can be a user-defined value or hccl_world_group. A string containing a maximum of 128 bytes, including the end character. |
Returns
recv_data: The result tensor after the all_to_all_v_c operation is performed on the input tensor.
Constraints
- The rank that calls this API must be within the range defined by the argument group of the current API. The entered rank ID must be valid and unique. Otherwise, the API call fails.
- For the
Atlas Training Series Product , the alltoallvc communicators must meet the following requirement:The communicators of 1p and 2p in a single server must be in the same cluster (with devices 0–3 and devices 4–7 each belonging to a separate cluster). In the communicators of 4p and 8p in a single server and multiple servers, the ranks must be based on the clusters, and the selected clusters in servers must be consistent.
- The value of send_count_matrix on each node must be the same.
- The performance of the alltoallvc operation is related to the size of the buffer for storing shared data between NPUs. When the communication data size exceeds the buffer size, the performance deteriorates significantly. If the alltoallvc communication data size in the service is large, you are advised to increase the buffer size appropriately by setting environment variable HCCL_BUFFSIZE to improve the communication performance.
Applicability
Example
The following is only a code snippet and cannot be executed. For details about how to call the HCCL Python APIs to perform collective communication, see Sample Code.
1 2 3 4 |
from npu_bridge.npu_init import * send_data_tensor = tf.random_uniform((1, 3), minval=1, maxval=10, dtype=tf.float32) send_counts_matrix_tensor = tf.Variable( [[3,3],[3,3]], dtype=tf.int64) all_to_all_v_c = hccl_ops.all_to_all_v_c(send_data_tensor, send_counts_matrix_tensor, 0) |