LayerNorm

Function Usage

This section describes the following two LayerNorm APIs based on the API output:

  • Output the normalization result, mean, and variance.

    During the training process of a deep neural network, updating the training parameters in the earlier layers can cause changes in the input data distribution for subsequent layers, resulting in unbalanced weight updates and reduced learning efficiency. Implementing the normalization policy to scale the input data of network layers to the [0, 1] range standardizes the distributions of both input and output data across network layers. This expedites the convergence of training parameters and ensures more stable improvements in learning efficiency. LayerNorm is one of many normalization methods.

    This API implements LayerNorm normalization for input data with a shape of [B, S, H]. The calculation formula is as follows, where γ is the scaling coefficient, β is the translation coefficient, and ε is the weight coefficient for preventing division by zero:

    The following two parameters respectively represent the mean and variance of the input on the H axis:

  • Output the normalization result, mean, and reciprocal of the standard deviation.

    This API implements LayerNorm normalization for input data with a shape of [A, R]. The calculation formula is as follows, where γ is the scaling coefficient, β is the translation coefficient, and ε is the weight coefficient for preventing division by zero:

    The following three parameters respectively represent the mean, variance, and reciprocal of the standard deviation of the input on the R axis:

Principles

  • Output the normalization result, mean, and variance.

    The figure below illustrates the internal algorithm block diagram of LayerNorm high-level APIs, taking the float type, ND format, and inputs inputX[B, S, H], gamma[H], and beta[H] as examples.

    Figure 1 LayerNorm algorithm block diagram

    The computation process is divided into the following steps, all of which are performed on vectors:

    1. Calculate the mean: Muls calculates the value of x*1/m, and then calculates the accumulated value ReduceSum to obtain the mean outputMean.
    2. Calculate the variance: Sub calculates the difference between input x and the mean, uses Mul to square the difference, multiplies Muls by 1/m, and calculates the accumulated value to obtain the variance outputVariance.
    3. Process gamma and beta: Obtain gamma and beta in the BSH dimension by broadcasting.
    4. Calculate the output: Broadcast the variance to obtain the BSH-dimension tensor, which passes through Adds(outputVariance, eps), Ln, Muls, and Exp in sequence and is then multiplied by (x – mean). The obtained result is multiplied by gamma and added with beta to obtain the output result.
  • Output the normalization result, mean, and reciprocal of the standard deviation.

    The figure below illustrates the internal algorithm block diagram of LayerNorm high-level APIs, taking the float type, ND format, and inputs inputX[A, R], gamma[R], and beta[R] as examples.

    Figure 2 LayerNorm-Rstd algorithm block diagram

    The computation process is divided into the following steps, all of which are performed on vectors with the A axis being considered as the outermost loop.

    1. Calculate the mean: Use the dichotomy accumulation method to multiply each element of x by 1/(2k + m) to prevent overflow of subsequent accumulations. Then, split the data into an entire block and a tail block, where the entire block includes 2k elements, and the tail block includes m elements. Add the data of the tail block to the data of the entire block. For ease of description, VL is defined as the number of elements participating in a single computation. ReduceSum calculates per VL of data in the entire block, Vadd calculates odd-numbered and even-numbered bits per VL to obtain a result per VL, ReduceSum calculates based on the result, and Vmuls multiplies the result by (2k + m)/2k to obtain the output mean.
    2. Calculate rstd: Sub calculates the difference between input x and the mean, and Mul squares the difference. To prevent overflow, use the same dichotomy accumulation method to calculate the variance Variance of the squared result. Add the variance to the coefficient for preventing division by zero ε, and calculate the output rstd by using Rsqrt.
    3. Calculate the output: Sub calculates the difference between input x and the mean. Multiply the difference by rstd and gamma, and add the obtained result with beta to obtain the output result.

Prototype

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetLayerNormMaxMinTmpSize API provided in LayerNorm Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the LayerNorm API.

  • Output the normalization result, mean, and variance.
    • Pass the temporary space through the sharedTmpBuffer input parameter.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<T>& outputMean, const LocalTensor<T>& outputVariance, const LocalTensor<T>& inputX, const LocalTensor<T>& gamma, const LocalTensor<T>& beta, const LocalTensor<uint8_t>& sharedTmpBuffer, const T epsilon, LayerNormTiling& tiling)
      

      This method enables developers to allocate and manage the temporary buffer space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

    • Allocate the temporary space through the API framework.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<T>& outputMean, const LocalTensor<T>& outputVariance, const LocalTensor<T>& inputX, const LocalTensor<T>& gamma, const LocalTensor<T>& beta, const T epsilon, LayerNormTiling& tiling)
      

      When using this method, developers do not need to allocate the space, but must reserve the required size for the space.

Parameters

  • API for outputting the normalization result, mean, and variance
    Table 1 Parameters in the template

    Parameter

    Description

    T

    Data type of the operand.

    isReuseSource

    Whether the source operand can be modified. The default value is false. If developers permit the modification of the source operand, enable this parameter, and this can save certain buffer space.

    If this parameter is set to true, the buffer space of inputX is reused during internal computation of this API to save the buffer space. If this parameter is set to false, the buffer space of inputX is not reused during internal computation of this API.

    This parameter can be enabled for float data inputs but cannot be enabled for half data inputs.

    For details about how to use isReuseSource, see More Examples.

    Table 2 API parameters

    Parameter

    Input/Output

    Meaning

    output

    Output

    Destination operand, with a type of LocalTensor and a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor.

    outputMean

    Output

    Mean, with a type of LocalTensor and a shape of [B, S]. For details about the definition of the LocalTensor data structure, see LocalTensor.

    outputVariance

    Output

    Variance, with a type of LocalTensor and a shape of [B, S]. For details about the definition of the LocalTensor data structure, see LocalTensor.

    inputX

    Input

    Source operand, with a type of LocalTensor and a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputX must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.

    gamma

    Input

    Scaling coefficient, with a type of LocalTensor and a shape of [H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of gamma must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.

    beta

    Input

    Translation coefficient, with a type of LocalTensor and a shape of [H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of beta must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.

    sharedTmpBuffer

    Input

    Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see LayerNorm Tiling.

    The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

    epsilon

    Input

    Weight coefficient for preventing division by zero.

    tiling

    Input

    Tiling information required for LayerNorm computation. For details about how to obtain the tiling information, see LayerNorm Tiling.

Returns

None

Availability

Constraints

  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • API for outputting the normalization result, mean, and variance:
    • The tensor space of src and dst can be reused.
    • The input must be in ND format.
    • If the input data does not meet the alignment requirements, developers need to pad the data. The padded data should be set to 0 to prevent abnormal values from affecting network computation.
    • The last axis (H axis) cannot be split.

Example

  • Example of calling the API for outputting the normalization result, mean, and variance
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    #include "kernel_operator.h"
    
    template <typename dataType, bool isReuseSource = false>
    class KernelLayernorm {
    public:
        __aicore__ inline KernelLayernorm()
        {}
        __aicore__ inline void Init(GM_ADDR inputXGm, GM_ADDR gammGm, GM_ADDR betaGm, GM_ADDR outputGm,
            GM_ADDR outputMeanGm, GM_ADDR outputVarianceGm, const LayerNormTiling &tiling)
        {
            this->bLength = tiling.bLength;
            this->sLength = tiling.sLength;
            this->hLength = tiling.hLength;
            this->tiling = tiling;
    
            bshLength = bLength * sLength * hLength;
            bsLength = bLength * sLength;
    
            inputXGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(inputXGm), bshLength);
            gammGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(gammGm), hLength);
            betaGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(betaGm), hLength);
    
            outputGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(outputGm), bshLength);
            outputMeanGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(outputMeanGm), bsLength);
            outputVarianceGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(outputVarianceGm), bsLength);
    
            pipe.InitBuffer(inQueueX, 1, sizeof(dataType) * bshLength);
            pipe.InitBuffer(inQueueGamma, 1, sizeof(dataType) * hLength);
            pipe.InitBuffer(inQueueBeta, 1, sizeof(dataType) * hLength);
            pipe.InitBuffer(outQueue, 1, sizeof(dataType) * bshLength);
            pipe.InitBuffer(outQueueMean, 1, sizeof(dataType) * bsLength);
            pipe.InitBuffer(outQueueVariance, 1, sizeof(dataType) * bsLength);
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<dataType> inputXLocal = inQueueX.AllocTensor<dataType>();
            AscendC::LocalTensor<dataType> gammaLocal = inQueueGamma.AllocTensor<dataType>();
            AscendC::LocalTensor<dataType> betaLocal = inQueueBeta.AllocTensor<dataType>();
    
            AscendC::DataCopy(inputXLocal, inputXGlobal, bshLength);
            AscendC::DataCopy(gammaLocal, gammGlobal, hLength);
            AscendC::DataCopy(betaLocal, betaGlobal, hLength);
    
            inQueueX.EnQue(inputXLocal);
            inQueueGamma.EnQue(gammaLocal);
            inQueueBeta.EnQue(betaLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<dataType> inputXLocal = inQueueX.DeQue<dataType>();
            AscendC::LocalTensor<dataType> gammaLocal = inQueueGamma.DeQue<dataType>();
            AscendC::LocalTensor<dataType> betaLocal = inQueueBeta.DeQue<dataType>();
    
            AscendC::LocalTensor<dataType> outputLocal = outQueue.AllocTensor<dataType>();
            AscendC::LocalTensor<dataType> meanLocal = outQueueMean.AllocTensor<dataType>();
            AscendC::LocalTensor<dataType> varianceLocal = outQueueVariance.AllocTensor<dataType>();
    
            AscendC::LayerNorm<dataType, isReuseSource>(
                outputLocal, meanLocal, varianceLocal, inputXLocal, gammaLocal, betaLocal, (dataType)epsilon, tiling);
    
            outQueue.EnQue<dataType>(outputLocal);
            outQueueMean.EnQue<dataType>(meanLocal);
            outQueueVariance.EnQue<dataType>(varianceLocal);
    
            inQueueX.FreeTensor(inputXLocal);
            inQueueGamma.FreeTensor(gammaLocal);
            inQueueBeta.FreeTensor(betaLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<dataType> outputLocal = outQueue.DeQue<dataType>();
            AscendC::LocalTensor<dataType> meanLocal = outQueueMean.DeQue<dataType>();
            AscendC::LocalTensor<dataType> varianceLocal = outQueueVariance.DeQue<dataType>();
    
            AscendC::DataCopy(outputGlobal, outputLocal, bshLength);
            AscendC::DataCopy(outputMeanGlobal, meanLocal, bsLength);
            AscendC::DataCopy(outputVarianceGlobal, varianceLocal, bsLength);
    
            outQueue.FreeTensor(outputLocal);
            outQueueMean.FreeTensor(meanLocal);
            outQueueVariance.FreeTensor(varianceLocal);
        }
    
    private:
        AscendC::GlobalTensor<dataType> inputXGlobal;
        AscendC::GlobalTensor<dataType> gammGlobal;
        AscendC::GlobalTensor<dataType> betaGlobal;
        AscendC::GlobalTensor<dataType> outputGlobal;
        AscendC::GlobalTensor<dataType> outputMeanGlobal;
        AscendC::GlobalTensor<dataType> outputVarianceGlobal;
    
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueGamma;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueBeta;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueue;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueMean;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueVariance;
    
        uint32_t bLength;
        uint32_t sLength;
        uint32_t hLength;
        dataType epsilon = 0.001;
    
        uint32_t bshLength;
        uint32_t bsLength;
    
        LayerNormTiling tiling;
    };
    
    extern "C" __global__ __aicore__ void kernel_layernorm_operator(GM_ADDR inputXGm, GM_ADDR gammGm, GM_ADDR betaGm,
        GM_ADDR outputGm, GM_ADDR outputMeanGm, GM_ADDR outputVarianceGm, GM_ADDR tiling)
    {
        GET_TILING_DATA(tilingData, tiling);
        KernelLayernorm<half, false> op;
        op.Init(inputXGm, gammGm, betaGm, outputGm, outputMeanGm, outputVarianceGm, tilingData.layernormTilingData);
        op.Process();
    }