AllGatherVOperation

产品支持情况

硬件型号	是否支持
Atlas A3 推理系列产品/Atlas A3 训练系列产品	x
Atlas A2 训练系列产品/Atlas 800I A2 推理产品	√
Atlas 训练系列产品	x
Atlas 推理系列产品	√
Atlas 200I/500 A2 推理产品	x

功能说明

将多个通信卡上的数据按通信编号的顺序在第一维进行聚合，然后发送到每张卡上。支持每张卡发送的数据不等长。

推理场景中会出现batch size不能被DP数整除的情况，reducescatter后续的计算算子需要按照batch维度处理数据，再将处理数据进行allgather，如图图1所示。

图1 算子上文示意图

示例：

图2 计算过程示意图

计算过程示意（python）：

# 计算goldtensor
gold_outtensor = []
for i in range(len(sendcount)):
    gold_outtensor= gold_outtensor+(tensorafters[i][0:sendcount[i]])
GoldenTensors = (torch.tensor(np.array(gold_outtensor+[0]*(sum*dim[1]-len(gold_outtensor))).reshape(sum,dim[1]), dtype=inTensorDtype))

使用场景

模型并行；模型并行里前向计算里的参数全同步，需要用AllgatherV把模型并行里将切分到不同的XPU上的参数全同步到一张XPU上才能进行前向计算。

>>> rank0 input
tensor([[0,1,2,3],
        [4,5,6,7]], device='npu:0')  shape[2,4]
>>> rank0 sendcount
tensor([4], device='npu:0')  shape[1]
>>> rank1 input
tensor([[3,2,1,0],
        [7,6,5,4],
        [7,6,5,4]], device='npu:1')  shape[3,4]
>>> rank1 sendcount
tensor([2], device='npu:1')  shape[1]
>>> recvout=tensor([4,2])
>>> recvdis=tensor([0,4])
>>> y=tensor([0,1,2,3,4], device='npu:0')
>>> rank0 output
tensor([[0,1,2,3],
        [3,2,0,0],
        [0,0,0,0],
        [0,0,0,0],
        [0,0,0,0]], device='npu:0')  shape[5,4]
>>> rank1 output
tensor([[0,1,2,3],
        [3,2,0,0],
        [0,0,0,0],
        [0,0,0,0],
        [0,0,0,0]], device='npu:1')  shape[5,4]

定义

struct AllGatherVParam {
    int rank = -1;
    int rankSize = 0;
    int rankRoot = 0;
    std::string backend = "hccl";
    HcclComm hcclComm = nullptr;
    CommMode commMode = COMM_MULTI_PROCESS;
    std::string rankTableFile;
    std::string commDomain;
    uint8_t rsv[64] = {0};
};

参数列表

成员名称	类型	默认值	描述
rank	int	-1	当前卡所属通信编号。-1表示未传。
rankSize	int	0	通信的卡的数量，不能为0。
rankRoot	int	0	主通信编号。
backend	std::string	"hccl"	通信计算类型，仅支持"hccl"。
hcclComm	HcclComm	nullptr	hccl通信域接口获取的地址指针。默认为空，加速库为用户创建；若用户想要自己管理通信域，则需要传入该通信域指针，加速库使用传入的通信域指针来执行通信算子。
commMode	CommMode	COMM_MULTI_PROCESS	通信模式，CommMode类型枚举值。hccl多线程只支持外部传入通信域方式。 COMM_UNDEFINED：未定义。 COMM_MULTI_PROCESS：指定多进程通信。 COMM_MULTI_THREAD：指定多线程通信。
rankTableFile	std::string	-	集群信息的配置文件路径。
commDomain	std::string	-	通信device组用通信域名标识，多通信域时使用，当前仅支持hccl。
rsv[64]	uint8_t	{0}	预留参数。

输入

参数	维度	数据类型	格式	是否必选	描述
x	[dim_0, dim_1, ..., dim_n]	"hccl": float16/int8/bfloat16	ND	是	输入tensor。
sendCount	1[1]	int64	ND	是	输入tensor，为本卡发送的数据量。支持每张卡的该tensor不同，即支持数据不等长。
recvCounts	1[ranksize]	int64	ND	是	输入tensor，为从对应索引卡号接收到的数据量，每张卡都一样。
rdispls	1[ranksize]	int64	ND	是	为从对应索引卡号接收到的数据量的偏移，每张卡都一样，rdispls[i] = n表示本rank从相对于输入起始位置的偏移量为n的位置开始接收rank_i的数据。
y	1[每个卡的第1维的shape的和]	float16	ND	是	shape为所有卡的合并tensor的首shape之和，用于infer shape。

输出

参数	维度	数据类型	格式	是否必选	描述
output	[n, dim_1, ..., dim_n]	"hccl": float16/int8/bfloat16	ND	否	输出tensor，与输入tensor地址不同，非原地写，n为所有卡的要合并tensor的第一维shape之和，即y的shape。数据类型和输入相同。

约束说明

backend必须是hccl。
数组参数recvCounts、rdispls的长度等于rankSize。
卡rank的sendCount与recvCounts[rank]相等。
out tensor第1维的shape等于y的shape。
仅支持Atlas 推理系列产品和Atlas 800I A2 推理产品。
各个通信卡的输入tensor的第0维维度可不同，其它维维度需相同。
recvCounts数组元素之和不能溢出int64，recvCounts的数组和要大于0。
sendCount的值不能超过输入tensor x的所有维度的shape的乘积。
rank、rankSize、rankRoot需满足以下条件。
- 0 ≤ rank < rankSize
- 0 ≤ rankRoot < rankSize

父主题： atb/infer_op_params.h