WelfordFinalize

Function Usage

Welford is a method for calculating the mean and variance online. On one hand, this method can gradually calculate the mean and variance of all samples without needing to store these samples, making it ideal for processing large-scale data. On the other hand, it requires only a single data traversal, reducing memory access and enhancing computational performance. This API is used for post-processing of the Welford algorithm.

If the reduced axis in the LayerNorm algorithm is large, you can split the reduced axis and use this API and WelfordUpdate together to implement equivalent calculation of LayerNorm. The calculation formula of this API is as follows:

  • Scenarios without the tail block or the counts parameter:

    indicates the mean output, and indicates the variance output.

    indicates the input th mean value, and indicates the input th variance. indicates the size of each reduced axis split. indicates the number of times that the reduced axis is split by . This formula is used only in the case of exact division.

  • Scenarios with the tail block or the counts parameter:

    Among other parameters than the mentioned, indicates the coefficient corresponding to , and indicates the length of the original reduced axis that is not split.

Prototype

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    • Scenarios without the counts parameter
      1
      2
      template <bool isReuseSource = false>
      __aicore__ inline void WelfordFinalize(const LocalTensor<float>& outputMean, const LocalTensor<float>& outputVariance, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<uint8_t>& sharedTmpBuffer, WelfordFinalizePara& para)
      
    • Scenarios with the counts parameter
      1
      2
      template <bool isReuseSource = false>
      __aicore__ inline void WelfordFinalize(const LocalTensor<float>& outputMean, const LocalTensor<float>& outputVariance, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<int32_t>& counts, const LocalTensor<uint8_t>& sharedTmpBuffer, WelfordFinalizePara& para)
      
  • Allocate the temporary space through the API framework.
    • Scenarios without the counts parameter
      1
      2
      template <bool isReuseSource = false>
      __aicore__ inline void WelfordFinalize(const LocalTensor<float>& outputMean, const LocalTensor<float>& outputVariance, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, WelfordFinalizePara& para)
      
    • Scenarios with the counts parameter
      1
      2
      template <bool isReuseSource = false>
      __aicore__ inline void WelfordFinalize(const LocalTensor<float>& outputMean, const LocalTensor<float>& outputVariance, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<int32_t>& counts, WelfordFinalizePara& para)
      

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by you through the sharedTmpBuffer input parameter.

  • When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the space.
  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, reserve the temporary space. If sharedTmpBuffer is used, allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetWelfordFinalizeMaxMinTmpSize API provided in WelfordFinalize Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Parameters

Table 1 Parameters in the template

Parameter

Description

isReuseSource

This parameter is reserved. Pass the default value false.

Table 2 API parameters

Parameter

Input/Output

Description

outputMean

Output

Mean, destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

outputVariance

Output

Variance, destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputMean

Input

Mean, source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputVariance

Input

Variance, source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

counts

Input

Source operand. The shape is [abLength].

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

sharedTmpBuffer

Input

Temporary space.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex internal API computation and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see WelfordFinalize Tiling.

para

Input

Parameter information required for calculation. The WelfordFinalizePara type is defined as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
struct WelfordFinalizePara {
    uint32_t rnLength; // Number of times that the reduced axis is split by abLength (rounded down)
    uint32_t abLength; // Size of the reduced axis split
    uint32_t headCount; // Value of headCount in the API without counts
    uint32_t headCountLength; // Length corresponding to the value of headCount in the API without counts
    uint32_t tailCount; // Value of tailCount in the API without counts
    uint32_t tailCountLength; // Length corresponding to the value of tailCount in the API without counts
    float abRec; // Value of 1/abLength
    float rRec; // If there is no tail block, the value is 1/(rnLength*abLength). Conversely, the value is 1/abLength.
};
  • rnLength indicates the number of times that the input reduced axis is split by abLength. If there is a tail block after splitting, the number of times is rounded down.
  • abLength indicates the size of the reduced axis split. For the API without the counts parameter, abLength = headCountLength + tailCountLength.
  • headCount indicates the value of headCount. It is enabled in the API without the counts parameter and is used as the counts coefficient of the non-tail block in the formula.
  • headCountLength indicates the length corresponding to the value of headCount and is enabled in the API without the counts parameter.
  • tailCount indicates the value of tailCount. It is enabled in the API without the counts parameter and is used as the counts coefficient of the tail block in the formula.
  • tailCountLength indicates the length corresponding to the value of tailCount and is enabled in the API without the counts parameter.
  • abRec indicates the reciprocal of abLength, that is, 1/abLength.
  • rRec indicates, after the input reduced axis is split, the value of 1/(rnLength x abLength) if there is no tail block or the value of 1/abLength if there is a tail block.

Returns

None

Availability

Precautions

Example