LayerNorm
Function Usage
This section describes the following two LayerNorm APIs based on the API output:
- Output the normalization result, mean, and variance.
During the training process of a deep neural network, updating the training parameters in the earlier layers can cause changes in the input data distribution for subsequent layers, resulting in unbalanced weight updates and reduced learning efficiency. Implementing the normalization policy to scale the input data of network layers to the [0, 1] range standardizes the distributions of both input and output data across network layers. This expedites the convergence of training parameters and ensures more stable improvements in learning efficiency. LayerNorm is one of many normalization methods.
This API implements LayerNorm normalization for input data with a shape of [B, S, H]. The calculation formula is as follows, where γ is the scaling coefficient, β is the translation coefficient, and ε is the weight coefficient for preventing division by zero:

The following two parameters respectively represent the mean and variance of the input on the H axis:

- Output the normalization result, mean, and reciprocal of the standard deviation.
This API implements LayerNorm normalization for input data with a shape of [A, R]. The calculation formula is as follows, where γ is the scaling coefficient, β is the translation coefficient, and ε is the weight coefficient for preventing division by zero:

The following three parameters respectively represent the mean, variance, and reciprocal of the standard deviation of the input on the R axis:

Principles
- Output the normalization result, mean, and variance.
The figure below illustrates the internal algorithm block diagram of LayerNorm high-level APIs, taking the float type, ND format, and inputs inputX[B, S, H], gamma[H], and beta[H] as examples.
Figure 1 LayerNorm algorithm block diagram
The computation process is divided into the following steps, all of which are performed on vectors:
- Calculate the mean: Muls calculates the value of x*1/m, and then calculates the accumulated value ReduceSum to obtain the mean outputMean.
- Calculate the variance: Sub calculates the difference between input x and the mean, uses Mul to square the difference, multiplies Muls by 1/m, and calculates the accumulated value to obtain the variance outputVariance.
- Process gamma and beta: Obtain gamma and beta in the BSH dimension by broadcasting.
- Calculate the output: Broadcast the variance to obtain the BSH-dimension tensor, which passes through Adds(outputVariance, eps), Ln, Muls, and Exp in sequence and is then multiplied by (x – mean). The obtained result is multiplied by gamma and added with beta to obtain the output result.
- Output the normalization result, mean, and reciprocal of the standard deviation.
The figure below illustrates the internal algorithm block diagram of LayerNorm high-level APIs, taking the float type, ND format, and inputs inputX[A, R], gamma[R], and beta[R] as examples.
Figure 2 LayerNorm-Rstd algorithm block diagram
The computation process is divided into the following steps, all of which are performed on vectors with the A axis being considered as the outermost loop.
- Calculate the mean: Use the dichotomy accumulation method to multiply each element of x by 1/(2k + m) to prevent overflow of subsequent accumulations. Then, split the data into an entire block and a tail block, where the entire block includes 2k elements, and the tail block includes m elements. Add the data of the tail block to the data of the entire block. For ease of description, VL is defined as the number of elements participating in a single computation. ReduceSum calculates per VL of data in the entire block, Vadd calculates odd-numbered and even-numbered bits per VL to obtain a result per VL, ReduceSum calculates based on the result, and Vmuls multiplies the result by (2k + m)/2k to obtain the output mean.
- Calculate rstd: Sub calculates the difference between input x and the mean, and Mul squares the difference. To prevent overflow, use the same dichotomy accumulation method to calculate the variance Variance of the squared result. Add the variance to the coefficient for preventing division by zero ε, and calculate the output rstd by using Rsqrt.
- Calculate the output: Sub calculates the difference between input x and the mean. Multiply the difference by rstd and gamma, and add the obtained result with beta to obtain the output result.
Prototype
Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetLayerNormMaxMinTmpSize API provided in LayerNorm Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.
The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the LayerNorm API.
- Output the normalization result, mean, and variance.
- Pass the temporary space through the sharedTmpBuffer input parameter.
1 2
template <typename T, bool isReuseSource = false> __aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<T>& outputMean, const LocalTensor<T>& outputVariance, const LocalTensor<T>& inputX, const LocalTensor<T>& gamma, const LocalTensor<T>& beta, const LocalTensor<uint8_t>& sharedTmpBuffer, const T epsilon, LayerNormTiling& tiling)
This method enables developers to allocate and manage the temporary buffer space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
- Allocate the temporary space through the API framework.
1 2
template <typename T, bool isReuseSource = false> __aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<T>& outputMean, const LocalTensor<T>& outputVariance, const LocalTensor<T>& inputX, const LocalTensor<T>& gamma, const LocalTensor<T>& beta, const T epsilon, LayerNormTiling& tiling)
When using this method, developers do not need to allocate the space, but must reserve the required size for the space.
- Pass the temporary space through the sharedTmpBuffer input parameter.
Parameters
- API for outputting the normalization result, mean, and variance
Table 1 Parameters in the template Parameter
Description
T
Data type of the operand.
isReuseSource
Whether the source operand can be modified. The default value is false. If developers permit the modification of the source operand, enable this parameter, and this can save certain buffer space.
If this parameter is set to true, the buffer space of inputX is reused during internal computation of this API to save the buffer space. If this parameter is set to false, the buffer space of inputX is not reused during internal computation of this API.
This parameter can be enabled for float data inputs but cannot be enabled for half data inputs.
For details about how to use isReuseSource, see More Examples.
Table 2 API parameters Parameter
Input/Output
Meaning
output
Output
Destination operand, with a type of LocalTensor and a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor.
outputMean
Output
Mean, with a type of LocalTensor and a shape of [B, S]. For details about the definition of the LocalTensor data structure, see LocalTensor.
outputVariance
Output
Variance, with a type of LocalTensor and a shape of [B, S]. For details about the definition of the LocalTensor data structure, see LocalTensor.
inputX
Input
Source operand, with a type of LocalTensor and a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputX must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.
gamma
Input
Scaling coefficient, with a type of LocalTensor and a shape of [H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of gamma must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.
beta
Input
Translation coefficient, with a type of LocalTensor and a shape of [H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of beta must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.
sharedTmpBuffer
Input
Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see LayerNorm Tiling.
The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
epsilon
Input
Weight coefficient for preventing division by zero.
tiling
Input
Tiling information required for LayerNorm computation. For details about how to obtain the tiling information, see LayerNorm Tiling.
Returns
None
Availability
Constraints
- For details about the alignment requirements of the operand address offset, see General Restrictions.
- API for outputting the normalization result, mean, and variance:
- The tensor space of src and dst can be reused.
- The input must be in ND format.
- If the input data does not meet the alignment requirements, developers need to pad the data. The padded data should be set to 0 to prevent abnormal values from affecting network computation.
- The last axis (H axis) cannot be split.
Example
- Example of calling the API for outputting the normalization result, mean, and variance
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
#include "kernel_operator.h" template <typename dataType, bool isReuseSource = false> class KernelLayernorm { public: __aicore__ inline KernelLayernorm() {} __aicore__ inline void Init(GM_ADDR inputXGm, GM_ADDR gammGm, GM_ADDR betaGm, GM_ADDR outputGm, GM_ADDR outputMeanGm, GM_ADDR outputVarianceGm, const LayerNormTiling &tiling) { this->bLength = tiling.bLength; this->sLength = tiling.sLength; this->hLength = tiling.hLength; this->tiling = tiling; bshLength = bLength * sLength * hLength; bsLength = bLength * sLength; inputXGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(inputXGm), bshLength); gammGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(gammGm), hLength); betaGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(betaGm), hLength); outputGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(outputGm), bshLength); outputMeanGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(outputMeanGm), bsLength); outputVarianceGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(outputVarianceGm), bsLength); pipe.InitBuffer(inQueueX, 1, sizeof(dataType) * bshLength); pipe.InitBuffer(inQueueGamma, 1, sizeof(dataType) * hLength); pipe.InitBuffer(inQueueBeta, 1, sizeof(dataType) * hLength); pipe.InitBuffer(outQueue, 1, sizeof(dataType) * bshLength); pipe.InitBuffer(outQueueMean, 1, sizeof(dataType) * bsLength); pipe.InitBuffer(outQueueVariance, 1, sizeof(dataType) * bsLength); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<dataType> inputXLocal = inQueueX.AllocTensor<dataType>(); AscendC::LocalTensor<dataType> gammaLocal = inQueueGamma.AllocTensor<dataType>(); AscendC::LocalTensor<dataType> betaLocal = inQueueBeta.AllocTensor<dataType>(); AscendC::DataCopy(inputXLocal, inputXGlobal, bshLength); AscendC::DataCopy(gammaLocal, gammGlobal, hLength); AscendC::DataCopy(betaLocal, betaGlobal, hLength); inQueueX.EnQue(inputXLocal); inQueueGamma.EnQue(gammaLocal); inQueueBeta.EnQue(betaLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<dataType> inputXLocal = inQueueX.DeQue<dataType>(); AscendC::LocalTensor<dataType> gammaLocal = inQueueGamma.DeQue<dataType>(); AscendC::LocalTensor<dataType> betaLocal = inQueueBeta.DeQue<dataType>(); AscendC::LocalTensor<dataType> outputLocal = outQueue.AllocTensor<dataType>(); AscendC::LocalTensor<dataType> meanLocal = outQueueMean.AllocTensor<dataType>(); AscendC::LocalTensor<dataType> varianceLocal = outQueueVariance.AllocTensor<dataType>(); AscendC::LayerNorm<dataType, isReuseSource>( outputLocal, meanLocal, varianceLocal, inputXLocal, gammaLocal, betaLocal, (dataType)epsilon, tiling); outQueue.EnQue<dataType>(outputLocal); outQueueMean.EnQue<dataType>(meanLocal); outQueueVariance.EnQue<dataType>(varianceLocal); inQueueX.FreeTensor(inputXLocal); inQueueGamma.FreeTensor(gammaLocal); inQueueBeta.FreeTensor(betaLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<dataType> outputLocal = outQueue.DeQue<dataType>(); AscendC::LocalTensor<dataType> meanLocal = outQueueMean.DeQue<dataType>(); AscendC::LocalTensor<dataType> varianceLocal = outQueueVariance.DeQue<dataType>(); AscendC::DataCopy(outputGlobal, outputLocal, bshLength); AscendC::DataCopy(outputMeanGlobal, meanLocal, bsLength); AscendC::DataCopy(outputVarianceGlobal, varianceLocal, bsLength); outQueue.FreeTensor(outputLocal); outQueueMean.FreeTensor(meanLocal); outQueueVariance.FreeTensor(varianceLocal); } private: AscendC::GlobalTensor<dataType> inputXGlobal; AscendC::GlobalTensor<dataType> gammGlobal; AscendC::GlobalTensor<dataType> betaGlobal; AscendC::GlobalTensor<dataType> outputGlobal; AscendC::GlobalTensor<dataType> outputMeanGlobal; AscendC::GlobalTensor<dataType> outputVarianceGlobal; AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueGamma; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueBeta; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueue; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueMean; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueVariance; uint32_t bLength; uint32_t sLength; uint32_t hLength; dataType epsilon = 0.001; uint32_t bshLength; uint32_t bsLength; LayerNormTiling tiling; }; extern "C" __global__ __aicore__ void kernel_layernorm_operator(GM_ADDR inputXGm, GM_ADDR gammGm, GM_ADDR betaGm, GM_ADDR outputGm, GM_ADDR outputMeanGm, GM_ADDR outputVarianceGm, GM_ADDR tiling) { GET_TILING_DATA(tilingData, tiling); KernelLayernorm<half, false> op; op.Init(inputXGm, gammGm, betaGm, outputGm, outputMeanGm, outputVarianceGm, tilingData.layernormTilingData); op.Process(); }