aclnnMoeTokenPermuteWithRoutingMap

产品支持情况

产品	是否支持
[object Object]Atlas A3 训练系列产品/Atlas A3 推理系列产品[object Object]	√
[object Object]Atlas A2 训练系列产品/Atlas 800I A2 推理产品/A200I A2 Box 异构组件[object Object]	√
[object Object]Atlas 200I/500 A2 推理产品[object Object]	×
[object Object]Atlas 推理系列产品 [object Object]	×
[object Object]Atlas 训练系列产品[object Object]	×

功能说明

算子功能：MoE的permute计算，将token和expert的标签作为routingMap传入，根据routingMaps将tokens和可选probsOptional广播后排序
计算公式： tokens_num 为routingMap的第0维大小，expert_num为routingMap的第1维大小 dropAndPad为false时 $expertIndex=arrange(tokens\_num).expand(expert\_num,-1)$ $sortedIndicesFirst=expertIndex.maskedselect(routingMap.T)$ $sortedIndicesOut=argSort(sortedIndicesFirst)$ $topK = numOutTokens // tokens\_num$ $outToken = topK * tokens\_num$ $permuteTokens[sortedIndicesOut[i]]=tokens[i//topK]$ $permuteProbsOutOptional=probsOptional.T.maskedselect(routingMap.T)$ dropAndPad为true时 $capacity = numOutTokens // expert\_num$ $outToken = capacity * expert\_num$ $sortedIndicesOut = argsort(routingMap.T,dim=-1)[:, :capacity]$ $permutedTokensOut = tokens.index_select(0, sorted_indices)$ 如果probs不是none $robs\_T\_1D = probsOptional.T.view(-1)$ $indices\_dim0 = arange(num\_experts)$ $indices\_dim1 = sorted_indices.view(expert\_num, capacity)$ $indices\_1D = (indices_dim0 * tokens\_num + indices\_dim1).view(-1)$ $permuteProbsOutOptional = probs\_T\_1D.index_select(0, indices_1D)$

函数原型

每个算子分为undefined，必须先调用 “aclnnMoeTokenPermuteWithRoutingMapGetWorkspaceSize”接口获取计算所需workspace大小以及包含了算子计算流程的执行器，再调用“aclnnMoeTokenPermuteWithRoutingMap”接口执行计算。

aclnnStatus aclnnMoeTokenPermuteWithRoutingMapGetWorkspaceSize(const aclTensor *tokens, const aclTensor *routingMap, const aclTensor *probsOptional, int64_t numOutTokens, bool dropAndPad, aclTensor *permuteTokensOut, aclTensor *permuteProbsOutOptional, aclTensor *sortedIndicesOut, uint64_t *workspaceSize, aclOpExecutor **executor)
aclnnStatus aclnnMoeTokenPermuteWithRoutingMap(void *workspace, uint64_t workspaceSize, aclOpExecutor *executor, aclrtStream stream)

aclnnMoeTokenPermuteWithRoutingMapGetWorkspaceSize

参数说明：
- tokens（aclTensor *，计算输入）：Device侧的aclTensor，输入token，公式中的tokens，要求为一个维度为2D的Tensor，shape为 (tokens_num, hidden_size)，数据类型支持BFLOAT16，FLOAT16，FLOAT，undefined要求为ND。支持undefined。
- routingMap（aclTensor *，计算输入）：Device侧的aclTensor，公式中的routingMap，代表token到expert的映射关系，要求shape为一个2D的（tokens_num，experts_num），数据类型支持INT8、BOOL。当数据类型为INT8，取值支持0、1，当数据类型为bool，取值支持true、false，undefined要求为ND。支持undefined。非droppad模式要求每行中包含topK个true 或 1。
- probsOptional（aclTensor *，计算输入）：Device侧的aclTensor，可选输入probsOptional，公式中的probsOptional，要求元素个数与routingMap相同,当probsOptional为空时，可选输出permuteProbsOutOptional为空，数据类型同tokens。undefined要求为ND。支持undefined。
- numOutTokens（int64_t，计算输入）：公式中的numOutTokens，用于计算公式中topK 和capacity 的有效输出token数。
- dropAndPad（bool，计算输入）：公式中的dropAndPad，表示是否开启dropAndPad模式。
- permutedTokensOut（aclTensor *，计算输出）：Device侧的aclTensor，公式中的permutedTokensOut，根据indices进行扩展并排序筛选过的tokens，要求是一个2D的Tensor，shape为(outToken, hidden_size)，即公式中的outToken。数据类型同tokens，undefined要求为ND。支持undefined。
- sortedIndicesOut（aclTensor *，计算输出）：Device侧的aclTensor，公式中的sortedIndicesOut，permute_tokens和tokens的映射关系，要求是一个1D的Tensor，Shape为(outToken)，即公式中的outToken，数据类型支持INT32，undefined要求为ND。支持undefined。
- permuteProbsOutOptional（aclTensor *，计算输出）：Device侧的aclTensor，公式中的permuteProbsOutOptional，根据indices进行排序并筛选过的probsOptional，Shape为(outToken)，即公式中的outToken，数据类型同probsOptional，undefined要求为ND。支持undefined。
- workspaceSize（uint64_t *，出参）：返回用户需要在Device侧申请的workspace大小。
- executor（aclOpExecutor **，出参）：返回op执行器，包含了算子计算流程。
返回值：

aclnnStatus：返回状态码，具体参见undefined。

[object Object]

aclnnMoeTokenPermuteWithRoutingMap

参数说明：
- workspace（void*，入参）：在Device侧申请的workspace内存地址。
- workspaceSize（uint64_t，入参）：在Device侧申请的workspace大小，由第一段接口aclnnMoeTokenPermuteWithRoutingMapGetWorkspaceSize获取。
- executor（aclOpExecutor*，入参）：op执行器，包含了算子计算流程。
- stream（aclrtStream,入参）：指定执行任务的Stream。
返回值：

返回aclnnStatus状态码，具体参见undefined。

约束说明

tokens_num和experts_num要求小于16777215，pad模式为false时routingMap 中每行为1或true的个数固定且小于512。

调用示例

示例代码如下，仅供参考，具体编译和执行过程请参考undefined。

[object Object]