LayerNorm

函数功能

在深层神经网络训练过程中，前面层训练参数的更新，会引起后面层输入数据分布的变化，导致权重更新不均衡及学习效率变慢。通过采用归一化策略，将网络层输入数据收敛到[0, 1]之间，可以规范网络层输入输出数据分布，加速训练参数收敛过程，使学习效率提升更加稳定。LayerNorm是许多归一化方法中的一种。

本接口实现了对shape大小为[B，S，H]输入数据的LayerNorm归一化，其计算公式如下，其中γ为缩放系数，β为平移系数，ε为防除零的权重系数：

$\text{[math]}$

其中，如下两个参数分别代表输入在H轴的均值和方差。

$\text{[math]}$

实现原理

以float类型，ND格式，输入为inputX[B, S, H]，gamma[H]和beta[H]为例，描述LayerNorm高阶API内部算法框图，如下图所示。

图1 LayerNorm算法框图

计算过程分为如下几步，均在Vector上进行：

计算均值：Muls计算x*1/m的值，再计算累加值ReduceSum，得到均值outputMean；
计算方差：Sub计算出输入x与均值的差值，再用Mul进行平方计算，最后用Muls乘上1/m并计算累加值，得到方差outputVariance；
处理gamma和beta：通过broadcast得到BSH维度的gamma和beta；
计算输出：方差通过broadcast得到BSH维度的tensor，再依次经过Adds(outputVariance, eps), Ln, Muls, Exp，最后与(x-均值）相乘，得到的结果乘上gamma，加上beta，得到输出结果。

函数原型

由于该接口的内部实现中涉及复杂的计算，需要额外的临时空间来存储计算过程中的中间变量。临时空间大小BufferSize的获取方法：通过LayerNorm Tiling中提供的GetLayerNormMaxMinTmpSize接口获取所需最大和最小临时空间大小，最小空间可以保证功能正确，最大空间用于提升性能。

临时空间支持接口框架申请和开发者通过sharedTmpBuffer入参传入两种方式，因此LayerNorm接口的函数原型有两种：

通过sharedTmpBuffer入参传入临时空间
 template <typename T, bool isReuseSource = false>

__aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<T>& outputMean, const LocalTensor<T>& outputVariance, const LocalTensor<T>& inputX, const LocalTensor<T>& gamma, const LocalTensor<T>& beta, const LocalTensor<uint8_t>& sharedTmpBuffer, const T epsilon, LayerNormTiling& tiling)

该方式下开发者需自行申请并管理临时内存空间并管理，并在接口调用完成后，复用该部分内存，内存不会反复申请释放，灵活性较高，内存利用率也较高。

接口框架申请临时空间
 template <typename T, bool isReuseSource = false>

__aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<T>& outputMean, const LocalTensor<T>& outputVariance, const LocalTensor<T>& inputX, const LocalTensor<T>& gamma, const LocalTensor<T>& beta, const T epsilon, LayerNormTiling& tiling)

该方式下开发者无需申请，但是需要预留临时空间的大小。

参数说明

表1 模板参数说明
参数名	描述
T	操作数的数据类型。
isReuseSource	是否允许修改源操作数，默认值为false。如果开发者允许源操作数被改写，可以使能该参数，使能后能够节省部分内存空间。设置为true，则本接口内部计算时复用inputX的内存空间，节省内存空间；设置为false，则本接口内部计算时不复用inputX的内存空间，本接口内部计算时会额外申请临时内存，接口调用完成后，自动释放临时内存。对于float数据类型输入支持开启该参数，half数据类型输入不支持开启该参数。

表2 接口参数说明
参数名称	输入/输出	含义
output	输出	目的操作数，类型为LocalTensor，shape为[B, S, H]，LocalTensor数据结构的定义请参考LocalTensor。 Atlas A2训练系列产品，支持的数据类型为：half/float Atlas推理系列产品AI Core,支持的数据类型为：half/float
outputMean	输出	均值，类型为LocalTensor，shape为[B, S]，LocalTensor数据结构的定义请参考LocalTensor。 Atlas A2训练系列产品，支持的数据类型为：half/float Atlas推理系列产品AI Core,支持的数据类型为：half/float
outputVariance	输出	方差，类型为LocalTensor，shape为[B, S]，LocalTensor数据结构的定义请参考LocalTensor。 Atlas A2训练系列产品，支持的数据类型为：half/float Atlas推理系列产品AI Core,支持的数据类型为：half/float
inputX	输入	源操作数，类型为LocalTensor，shape为[B, S, H]，LocalTensor数据结构的定义请参考LocalTensor。inputX的数据类型需要与目的操作数保持一致，尾轴长度需要32B对齐。 Atlas A2训练系列产品，支持的数据类型为：half/float Atlas推理系列产品AI Core,支持的数据类型为：half/float
gamma	输入	缩放系数，类型为LocalTensor，shape为[H]，LocalTensor数据结构的定义请参考LocalTensor。gamm的数据类型需要与目的操作数保持一致，尾轴长度需要32B对齐。 Atlas A2训练系列产品，支持的数据类型为：half/float Atlas推理系列产品AI Core,支持的数据类型为：half/float
beta	输入	平移系数，类型为LocalTensor，shape为[H]，LocalTensor数据结构的定义请参考LocalTensor。beta的数据类型需要与目的操作数保持一致，尾轴长度需要32B对齐。 Atlas A2训练系列产品，支持的数据类型为：half/float Atlas推理系列产品AI Core,支持的数据类型为：half/float
sharedTmpBuffer	输入	共享缓冲区，用于存放API内部计算产生的临时数据。该方式开发者可以自行管理sharedTmpBuffer内存空间，并在接口调用完成后，复用该部分内存，内存不会反复申请释放，灵活性较高，内存利用率也较高。共享缓冲区大小的获取方式请参考LayerNorm Tiling。类型为LocalTensor，支持的TPosition为VECIN/VECCALC/VECOUT。 Atlas A2训练系列产品，支持的数据类型为：uint8_t Atlas推理系列产品AI Core，支持的数据类型为：uint8_t
epsilon	输入	防除零的权重系数。
tiling	输入	LayerNorm计算所需Tiling信息，Tiling信息的获取请参考LayerNorm Tiling。

返回值

无

支持的型号

Atlas A2训练系列产品

Atlas推理系列产品AI Core

约束说明

操作数地址偏移对齐要求请参见通用约束。

src和dst的Tensor空间可以复用。
仅支持输入shape为ND格式。
输入数据不满足对齐要求时，开发者需要进行补齐，补齐的数据应设置为0，防止出现异常值从而影响网络计算。
不支持对尾轴H轴的切分。

调用示例

#include "kernel_operator.h"

namespace AscendC {
template <typename dataType, bool isReuseSource = false> class KernelLayernorm {
public:
    __aicore__ inline KernelLayernorm() {}
    __aicore__ inline void Init(GM_ADDR inputX_gm, GM_ADDR gamm_gm, GM_ADDR beta_gm, GM_ADDR output_gm,
        GM_ADDR outputMean_gm, GM_ADDR outputVariance_gm, const LayerNormTiling& tiling)
    {
        this->bLength = tiling.bLength;
        this->sLength = tiling.sLength;
        this->hLength = tiling.hLength;
        this->tiling = tiling;

        bshLength = bLength * sLength * hLength;
        bsLength = bLength * sLength;

        inputX_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataType*>(inputX_gm), bshLength);
        gamm_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataType*>(gamm_gm), hLength);
        beta_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataType*>(beta_gm), hLength);

        output_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataType*>(output_gm), bshLength);
        outputMean_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataType*>(outputMean_gm), bsLength);
        outputVariance_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataType*>(outputVariance_gm), bsLength);

        pipe.InitBuffer(inQueueX, 1, sizeof(dataType) * bshLength);
        pipe.InitBuffer(inQueueGamma, 1, sizeof(dataType) * hLength);
        pipe.InitBuffer(inQueueBeta, 1, sizeof(dataType) * hLength);
        pipe.InitBuffer(outQueue, 1, sizeof(dataType) * bshLength);
        pipe.InitBuffer(outQueueMean, 1, sizeof(dataType) * bsLength);
        pipe.InitBuffer(outQueueVariance, 1, sizeof(dataType) * bsLength);
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        LocalTensor<dataType> inputXLocal = inQueueX.AllocTensor<dataType>();
        LocalTensor<dataType> gammaLocal = inQueueGamma.AllocTensor<dataType>();
        LocalTensor<dataType> betaLocal = inQueueBeta.AllocTensor<dataType>();

        DataCopy(inputXLocal, inputX_global, bshLength);
        DataCopy(gammaLocal, gamm_global, hLength);
        DataCopy(betaLocal, beta_global, hLength);

        inQueueX.EnQue(inputXLocal);
        inQueueGamma.EnQue(gammaLocal);
        inQueueBeta.EnQue(betaLocal);
    }
    __aicore__ inline void Compute()
    {
        LocalTensor<dataType> inputXLocal = inQueueX.DeQue<dataType>();
        LocalTensor<dataType> gammaLocal = inQueueGamma.DeQue<dataType>();
        LocalTensor<dataType> betaLocal = inQueueBeta.DeQue<dataType>();

        LocalTensor<dataType> outputLocal = outQueue.AllocTensor<dataType>();
        LocalTensor<dataType> meanLocal = outQueueMean.AllocTensor<dataType>();
        LocalTensor<dataType> varianceLocal = outQueueVariance.AllocTensor<dataType>();

        LayerNorm<dataType, isReuseSource>(outputLocal, meanLocal, varianceLocal, inputXLocal, gammaLocal, betaLocal,(dataType)epsilon, tiling);

        outQueue.EnQue<dataType>(outputLocal);
        outQueueMean.EnQue<dataType>(meanLocal);
        outQueueVariance.EnQue<dataType>(varianceLocal);

        inQueueX.FreeTensor(inputXLocal);
        inQueueGamma.FreeTensor(gammaLocal);
        inQueueBeta.FreeTensor(betaLocal);
    }
    __aicore__ inline void CopyOut()
    {
        LocalTensor<dataType> outputLocal = outQueue.DeQue<dataType>();
        LocalTensor<dataType> meanLocal = outQueueMean.DeQue<dataType>();
        LocalTensor<dataType> varianceLocal = outQueueVariance.DeQue<dataType>();

        DataCopy(output_global, outputLocal, bshLength);
        DataCopy(outputMean_global, meanLocal, bsLength);
        DataCopy(outputVariance_global, varianceLocal, bsLength);

        outQueue.FreeTensor(outputLocal);
        outQueueMean.FreeTensor(meanLocal);
        outQueueVariance.FreeTensor(varianceLocal);
    }

private:
    GlobalTensor<dataType> inputX_global;
    GlobalTensor<dataType> gamm_global;
    GlobalTensor<dataType> beta_global;
    GlobalTensor<dataType> output_global;
    GlobalTensor<dataType> outputMean_global;
    GlobalTensor<dataType> outputVariance_global;

    TPipe pipe;
    TQue<QuePosition::VECIN, 1> inQueueX;
    TQue<QuePosition::VECIN, 1> inQueueGamma;
    TQue<QuePosition::VECIN, 1> inQueueBeta;
    TQue<QuePosition::VECOUT, 1> outQueue;
    TQue<QuePosition::VECOUT, 1> outQueueMean;
    TQue<QuePosition::VECOUT, 1> outQueueVariance;

    uint32_t bLength;
    uint32_t sLength;
    uint32_t hLength;
    dataType epsilon = 0.001;

    uint32_t bshLength;
    uint32_t bsLength;
	
    LayerNormTiling tiling;
};

extern "C" __global__ __aicore__ void kernel_layernorm_operator(GM_ADDR inputX_gm, GM_ADDR gamm_gm, GM_ADDR beta_gm, GM_ADDR output_gm, GM_ADDR outputMean_gm, GM_ADDR outputVariance_gm, GM_ADDR tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelLayernorm<half, false> op;
    op.Init(inputX_gm, gamm_gm, beta_gm, output_gm, outputMean_gm, outputVariance_gm, tilingData.layernormTilingData);
    op.Process();
}
}

父主题： LayerNorm