LayerNorm

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

This section describes the following two LayerNorm APIs based on the API output.

For the input data with the shape of [B, S, H], the normalized result, mean value, and variance are output.
During the training process of a deep neural network, updating the training parameters in the earlier layers can cause changes in the input data distribution for subsequent layers, resulting in unbalanced weight updates and reduced learning efficiency. Implementing the normalization policy to scale the input data of network layers to the [0, 1] range standardizes the distributions of both input and output data across network layers. This expedites the convergence of training parameters and ensures more stable improvements in learning efficiency. LayerNorm is one of many normalization methods.

This API implements LayerNorm normalization for input data with a shape of [B, S, H]. The calculation formula is as follows, where γ is the scale coefficient, β is the translation coefficient, and ε is the weight coefficient for preventing division by zero:

$\text{[math]}$

The following two parameters respectively represent the mean and variance of the input on the H axis:

$\text{[math]}$

For the input data with the shape of [A, R], the normalized result, mean value, and reciprocal of the standard deviation are output.
This API implements LayerNorm normalization for input data with a shape of [A, R]. The calculation formula is as follows, where γ is the scale coefficient, β is the translation coefficient, and ε is the weight coefficient for preventing division by zero:

$\text{[math]}$

The following three parameters respectively represent the mean, variance, and reciprocal of the standard deviation of the input on the R axis:

$\text{[math]}$

Principles

For the input data with the shape of [B, S, H], the normalized result, mean value, and variance are output.
The figure below illustrates the internal algorithm block diagram of LayerNorm high-level APIs, taking the float type, ND format, and inputs inputX[B, S, H], gamma[H], and beta[H] as examples.

Figure 1 LayerNorm algorithm block diagram

The computation process is divided into the following steps, all of which are performed on vectors (m indicates the length of the last axis H):
1. Calculate the mean: Muls calculates the value of x*1/m, and then calculates the accumulated value ReduceSum to obtain the mean outputMean.
2. Calculate the variance: Sub calculates the difference between input x and the mean, uses Mul to square the difference, multiplies Muls by 1/m, and calculates the accumulated value to obtain the variance outputVariance.
3. Process gamma and beta: Obtain gamma and beta in the BSH dimension by broadcasting.
4. Compute the output: Broadcast the variance to obtain the BSH-dimension tensor, which passes through Adds(outputVariance, eps), Ln, Muls, and Exp in sequence and is then multiplied by (x – mean). The obtained result is multiplied by gamma and added with beta to obtain the output result.

For the input data with the shape of [A, R], the normalized result, mean value, and reciprocal of the standard deviation are output.
The figure below illustrates the internal algorithm block diagram of LayerNorm high-level APIs, taking the float type, ND format, and inputs inputX[A, R], gamma[R], and beta[R] as examples.

Figure 2 LayerNorm-Rstd algorithm block diagram

The computation process is divided into the following steps, all of which are performed on vectors with the A axis being considered as the outermost loop.
1. Compute the mean: Use the dichotomy accumulation method to multiply each element of x by 1/(2^k + m) to prevent overflow of subsequent accumulations. Then, sum up the data in binary accumulation mode: Split the data into a whole block and a tail block. The whole block contains 2^k elements, and the tail block contains m elements. The tail block data is added to the whole block data. For ease of description, Vnum is defined as the number of elements participating in a single computation. Perform Vadd on odd and even bits in the whole block by Vnum to obtain a result of Vnum length. Perform WholeReduceSum on the result to obtain the output mean.
2. Calculate rstd: Sub calculates the difference between input x and the mean, and Mul squares the difference. To prevent overflow, use the same dichotomy accumulation method to calculate Variance of the squared result. Add the variance to the coefficient for preventing division by zero ε, and calculate the output rstd by using Rsqrt.
3. Calculate the output: Sub calculates the difference between input x and the mean. Multiply the difference by rstd and gamma, and add the obtained result with beta to obtain the output result.

Prototype

Due to the complex computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetLayerNormMaxMinTmpSize API provided in LayerNorm Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the LayerNorm API.

For the input data with the shape of [B, S, H], the normalized result, mean value, and variance are output.

Pass to the temporary space through the sharedTmpBuffer input parameter.

          
               template <typename T, bool isReuseSource = false>
__aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<T>& outputMean, const LocalTensor<T>& outputVariance, const LocalTensor<T>& inputX, const LocalTensor<T>& gamma, const LocalTensor<T>& beta, const LocalTensor<uint8_t>& sharedTmpBuffer, const T epsilon, LayerNormTiling& tiling)

This method enables developers to allocate and manage the temporary memory space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated or deallocated, improving the flexibility and buffer utilization.

Allocate the temporary space through the API framework.

          
               template <typename T, bool isReuseSource = false>
__aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<T>& outputMean, const LocalTensor<T>& outputVariance, const LocalTensor<T>& inputX, const LocalTensor<T>& gamma, const LocalTensor<T>& beta, const T epsilon, LayerNormTiling& tiling)

When using this method, you do not need to allocate the space, but must reserve the required temporary space size.

For the input data with the shape of [A, R], the normalized result, mean value, and reciprocal of the standard deviation are output.

Pass to the temporary space through the sharedTmpBuffer input parameter.

          
               template <typename U, typename T, bool isReuseSource = false, const LayerNormConfig& config = LNCFG_NORM>
__aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<float>& outputMean, const LocalTensor<float>& outputRstd, const LocalTensor<T>& inputX, const LocalTensor<U>& gamma, const LocalTensor<U>& beta, const float epsilon, const LocalTensor<uint8_t>& sharedTmpBuffer, const LayerNormPara& para, const LayerNormSeparateTiling& tiling)

Allocate the temporary space through the API framework.

          
               template <typename U, typename T, bool isReuseSource = false, const LayerNormConfig& config = LNCFG_NORM>
__aicore__ inline void LayerNorm(const LocalTensor<T>& output, const LocalTensor<float>& outputMean, const LocalTensor<float>& outputRstd, const LocalTensor<T>& inputX, const LocalTensor<U>& gamma, const LocalTensor<U>& beta, const float epsilon, const LayerNormPara& para, const LayerNormSeparateTiling& tiling)

When using this method, you do not need to allocate the space, but must reserve the required temporary space size.

Parameters

API for outputting the normalized result, mean value, and variance of the input data with the shape of [B, S, H]

**Table 1** Template parameters
Parameter	Description
T	Data type of the operand. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and float. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and float. For the Atlas inference product 's AI Core, the supported data types are half and float.
isReuseSource	Whether the source operand can be modified. The default value is false. If you allow the source operand to be modified, enable this parameter to reduce memory space usage. If this parameter is set to true, the inputX memory space is reused during internal computation of this API to reduce memory space usage. If this parameter is set to false, the inputX memory space is not reused during internal computation of this API. This parameter can be enabled for float data inputs but cannot be enabled for half data inputs. For details about how to use isReuseSource, see Example 4.

**Table 2** API parameters
Parameter	Input/Output	Description
output	Output	Destination operand, with a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
outputMean	Output	Mean, with a shape of [B, S]. For details about the definition of the LocalTensor data structure, see LocalTensor. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
outputVariance	Output	Variance, with a shape of [B, S]. For details about the definition of the LocalTensor data structure, see LocalTensor. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
inputX	Input	Source operand, with a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputX must be the same as that of the destination operand, and the last axis length must be 32-byte aligned. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
gamma	Input	Scaling coefficient, with a shape of [H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of gamma must be the same as that of the destination operand, and the last axis length must be 32-byte aligned. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
beta	Input	Translation coefficient, with a shape of [H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of beta must be the same as that of the destination operand, and the last axis length must be 32-byte aligned. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
sharedTmpBuffer	Input	Shared buffer, which is used to store temporary data generated during internal API computation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see LayerNorm Tiling. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
epsilon	Input	Weight coefficient for preventing division by zero.
tiling	Input	Tiling information required for LayerNorm computation. For details about how to obtain the tiling information, see LayerNorm Tiling.

API for outputting the normalized result, mean value, or reciprocal of the standard deviation for the input data with the shape of [A, R]

Table 3 Template parameters

Parameter

Description

Data type of the beta and gamma operands.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and float.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and float.

For the Atlas inference product 's AI Core, the supported data types are half and float.

Data type of the output and inputX operands.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and float.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and float.

For the Atlas inference product 's AI Core, the supported data types are half and float.

isReuseSource

This parameter is reserved. Pass the default value false.

config

A parameter used to configure the input and output information of the LayerNorm API. The LayerNormConfig type is defined as follows:

             
                  struct LayerNormConfig {
    bool isNoBeta = false;
    bool isNoGamma = false;
    bool isOnlyOutput = false;
};

isNoBeta: Whether to use beta in computation.
- false: Default value. The input beta is used in LayerNorm computation.
- true: The input beta is not used in LayerNorm computation. In this case, computation related to beta in the formula is omitted.
isNoGamma: Whether the optional input gamma is used.
- false: Default value. The optional input gamma is used in LayerNorm computation.
- true: The input gamma is not used in LayerNorm computation. In this case, computation related to gamma in the formula is omitted.
isOnlyOutput: indicates whether only y is output, and mean and reciprocal of the standard deviation rstd are not output. Currently, this parameter can only be set to false, indicating that all y, mean, and rstd results are output.

Table 4 API parameters

Parameter

Input/Output

Description

output

Output

Destination operand, with a shape of [A, R]. For details about the definition of the LocalTensor data structure, see LocalTensor.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

outputMean

Output

Mean, with a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

outputRstd

Output

outputRstd is the reciprocal of the standard deviation.The shape is [A]. For details about the definition of the LocalTensor data structure, see LocalTensor.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputX

Input

Source operand, with a shape of [A, R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputX must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

gamma

Input

Scaling coefficient, with a shape of [R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type precision of gamma must be greater than or equal to that of the source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

beta

Input

Translation coefficient, with a shape of [R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type precision of beta must be greater than or equal to that of the source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

epsilon

Input

Weight coefficient for preventing division by zero.

sharedTmpBuffer

Input

Shared buffer, which is used to store temporary data generated during internal API computation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see LayerNorm Tiling.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

para

Input

Parameter information required for LayerNorm computation. The LayerNormPara type is defined as follows:

             
                  struct LayerNormPara {
    uint32_t aLength;
    uint32_t rLength;
    uint32_t rLengthWithPadding;
};

aLength: Specifies the length of inputX along the A axis.
rLength: specifies the length of the to-be-processed data of inputX on the R axis.
rLengthWithPadding: specifies the 32-byte aligned length of inputX along the R axis.

tiling

Input

Tiling information required for LayerNorm computation. For details about how to obtain the tiling information, see LayerNorm Tiling.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
Constraints on the API that outputs the normalization result, mean value, and variance for the input data with the shape of [B, S, H]:
- The space of output and inputX can be reused. The space of other outputs and inputs cannot be reused.
- If the tail axis H in the input data does not meet the alignment requirements, you need to pad the data. The padded data should be set to 0 to prevent abnormal values from affecting network computation.
- The last axis (H axis) cannot be split.
- The H axis lengths of inputX, output, gamma, and beta must be the same.
- The B axis lengths and S axis lengths of inputX, output, outputMean, and outputVariance must be the same.

Constraints on the API that outputs the normalized result, mean value, or reciprocal of the standard deviation for the input data with the shape of [A, R]:
- The data type precision of gamma and beta must be greater than or equal to that of the source operand.
- The tensor space of src and dst cannot be reused.
- The R axis cannot be split.

Example

Example of calling the API to output the normalized result, mean value, and variance when the shape of the input data is [B, S, H]

For details about the complete call example, see sample of the layernorm operator that outputs the variance.

         
          
            
            
              // The tiling data is obtained from the host. bshLength, hLength, bsLength, and epsilon are all obtained from the tiling data.
AscendC::TPipe pipe;
AscendC::TQue<QuePosition::VECIN, 1> inQueueX;
AscendC::TQue<QuePosition::VECIN, 1> inQueueGamma;
AscendC::TQue<QuePosition::VECIN, 1> inQueueBeta;
AscendC::TQue<QuePosition::VECOUT, 1> outQueue;
AscendC::TQue<QuePosition::VECOUT, 1> outQueueMean;
AscendC::TQue<QuePosition::VECOUT, 1> outQueueVariance;

pipe.InitBuffer(inQueueX, 1, sizeof(float) * bshLength);
pipe.InitBuffer(inQueueGamma, 1, sizeof(float) * hLength);
pipe.InitBuffer(inQueueBeta, 1, sizeof(float) * hLength);
pipe.InitBuffer(outQueue, 1, sizeof(float) * bshLength);
pipe.InitBuffer(outQueueMean, 1, sizeof(float) * bsLength);
pipe.InitBuffer(outQueueVariance, 1, sizeof(float) * bsLength);

AscendC::LocalTensor<float> inputX = inQueueX.AllocTensor<float>();
AscendC::LocalTensor<float> gamma = inQueueGamma.AllocTensor<float>();
AscendC::LocalTensor<float> beta = inQueueBeta.AllocTensor<float>();
AscendC::LocalTensor<float> output = outQueue.AllocTensor<float>();
AscendC::LocalTensor<float> mean = outQueueMean.AllocTensor<float>();
AscendC::LocalTensor<float> variance = outQueueVariance.AllocTensor<float>();
AscendC::LayerNorm<float, false>(output, mean, variance, inputX, gamma, beta, (float)epsilon, tiling);

             

           

         
        

Example of calling the API to output the normalized result, mean, or reciprocal of the standard deviation of the input data with shape [A, R]

For details about the complete call example, see sample of the layernorm operator that outputs the reciprocal of the standard deviation.

         
          
            
            
              AscendC::TPipe pipe;
AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX;
AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueGamma;
AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueBeta;
AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue;
AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueMean;
AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueRstd;

// arLength, rLengthWithPadding, aLength, rLength, and epsilon are all obtained from the tiling data.
pipe.InitBuffer(inQueueX, 1, sizeof(float) * arLength);
pipe.InitBuffer(inQueueGamma, 1, sizeof(float) * rLengthWithPadding);
pipe.InitBuffer(inQueueBeta, 1, sizeof(float) * rLengthWithPadding);
pipe.InitBuffer(outQueue, 1, sizeof(float) * arLength);
pipe.InitBuffer(outQueueMean, 1, sizeof(float) * aLength);
pipe.InitBuffer(outQueue1, 1, sizeof(float) * aLength);

AscendC::LocalTensor<float> inputX = inQueueX.AllocTensor<float>();
AscendC::LocalTensor<float> gamma = inQueueGamma.AllocTensor<float>();
AscendC::LocalTensor<float> beta = inQueueBeta.AllocTensor<float>();
AscendC::LocalTensor<float> output = outQueue.AllocTensor<float>();
AscendC::LocalTensor<float> mean = outQueueMean.AllocTensor<float>();
AscendC::LocalTensor<float> output1 = outQueue1.AllocTensor<float>();

// config is a compile-time constant. Its type and value are AscendC::LayerNormConfig{false, false, false}.
// Type and value of para: AscendC::LayerNormPara{aLength, rLength, rLengthWithPadding}
AscendC::LayerNorm<float, float, false, config>(output, mean, output1, inputX, gamma, beta, (float)epsilon, para, tiling);

             

           

         
        

Parent topic: Normalization