WelfordFinalize

Applicability

Product

Supported

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product's AI Core

Atlas inference product's Vector Core

x

Atlas training products

x

Function

Welford is a method for calculating the mean and variance online. On one hand, this method can gradually calculate the mean and variance of all samples without needing to store these samples, making it ideal for processing large-scale data. On the other hand, it requires only a single data traversal, reducing memory access and enhancing computational performance. This API is used for post-processing of the Welford algorithm.

If the Reduce axis in the LayerNorm algorithm is large, you can split the Reduce axis and use this API and WelfordUpdate together to implement equivalent calculation of LayerNorm. This API can be used to implement two computing formulas based on whether there is a tail block after the Reduce axis is split:

  • Scenarios without the tail block or the counts parameter:

    Mean indicates the mean output, and Var indicates the variance output.

    Meani indicates the ith mean value of the input, and Vari indicates the ith variance of the input. Ab indicates the size of a single computation after the Reduce axis is split, Rn indicates the number of times the Reduce axis is split based on Ab, and indicates the variance coefficient rRec.

  • Scenarios with the tail block or the counts parameter:

    In addition to the preceding parameters, countsi indicates the coefficient corresponding to Meani, R indicates the length of the original Reduce axis that is not split, and indicates the variance coefficient rRec.

Prototype

  • Pass to the temporary space through the sharedTmpBuffer input parameter.
    • Scenarios without the counts parameter
      1
      2
      template <bool isReuseSource = false>
      __aicore__ inline void WelfordFinalize(const LocalTensor<float>& outputMean, const LocalTensor<float>& outputVariance, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<uint8_t>& sharedTmpBuffer, WelfordFinalizePara& para)
      
    • Scenarios with the counts parameter
      1
      2
      template <bool isReuseSource = false>
      __aicore__ inline void WelfordFinalize(const LocalTensor<float>& outputMean, const LocalTensor<float>& outputVariance, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<int32_t>& counts, const LocalTensor<uint8_t>& sharedTmpBuffer, WelfordFinalizePara& para)
      
  • Allocate the temporary space through the API framework.
    • Scenarios without the counts parameter
      1
      2
      template <bool isReuseSource = false>
      __aicore__ inline void WelfordFinalize(const LocalTensor<float>& outputMean, const LocalTensor<float>& outputVariance, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, WelfordFinalizePara& para)
      
    • Scenarios with the counts parameter
      1
      2
      template <bool isReuseSource = false>
      __aicore__ inline void WelfordFinalize(const LocalTensor<float>& outputMean, const LocalTensor<float>& outputVariance, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<int32_t>& counts, WelfordFinalizePara& para)
      

Due to the complex computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

  • When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the temporary space.
  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetWelfordFinalizeMaxMinTmpSize API provided in WelfordFinalize Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Parameters

Table 1 Template parameters

Parameter

Description

isReuseSource

This parameter is reserved. Pass the default value false.

Table 2 API parameters

Parameter

Input/Output

Description

outputMean

Output

Destination operand of the mean. The data type is float. The output mean is one number, which requires sizeof(float) bytes for storage. According to the alignment requirements for storage units, you need to allocate a 32-byte aligned memory space to outputMean.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

outputVariance

Output

Destination operand of the variance. The data type is float. The output variance is one number, which requires sizeof(float) bytes for storage. According to the alignment requirements for storage units, you need to allocate a 32-byte aligned memory space to outputVariance.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputMean

Input

Source operand of the mean. The data type is float. The shape is [abLength].

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputVariance

Input

Source operand of the variance. The data type is float. The shape is [abLength].

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

counts

Input

Source operand. The data type is int32_t. The shape is [abLength].

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

sharedTmpBuffer

Input

Temporary space. The data type is uint8_t.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex internal API computation and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see WelfordFinalize Tiling.

para

Input

Parameter information required for calculation. The WelfordFinalizePara type is defined as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct WelfordFinalizePara {
    uint32_t rnLength;
    uint32_t abLength;
    uint32_t headCount;
    uint32_t headCountLength;
    uint32_t tailCount;
    uint32_t tailCountLength;
    float abRec;
    float rRec;
    
};
  • rnLength indicates the number of times that the input Reduce axis is split by abLength. If there is a tail block after splitting, the number of times is rounded up.
  • abLength indicates the splitting size on the Reduce axis. For the API without the counts parameter, abLength = headCountLength + tailCountLength.
  • headCount indicates the value of headCount. It is enabled in the API without the counts parameter and is used as the counts coefficient of the non-tail block in the formula.
  • headCountLength indicates the length corresponding to the value of headCount. It is enabled in the API without the counts parameter.
  • tailCount indicates the value of tailCount. It is enabled in the API without the counts parameter and is used as the counts coefficient of the tail block in the formula.
  • tailCountLength indicates the length corresponding to the value of tailCount. It is enabled in the API without the counts parameter.
  • abRec indicates the reciprocal of abLength, that is, 1/abLength.
  • rRec indicates the value of 1/(rnLength x abLength) if there is no tail block after the input Reduce axis is split. If there is a tail block, it indicates the value of 1/R.

Returns

None

Restrictions

  • The value of para.abLength must be an integer multiple of 32/sizeof(float).
  • The sum of the values of para.headCountLength and para.tailCountLength must be equal to the value of para.abLength.
  • The API processing logic is based on the value of the para parameter and does not depend on the shape information of the source operand.
  • When para.tailCount is 0, para.tailCountLength cannot be set to a non-zero value.
  • The source operand address must not overlap the destination operand address.
  • The address of sharedTmpBuffer must not overlap that of the source or destination operand.

Example

For a complete operator example, see welford_finalize operator sample.
1
2
3
4
pipe.InitBuffer(sharedTmpBuffer, stackBufferSize);        
AscendC::LocalTensor<uint8_t> tmpLocalTensor = sharedTmpBuffer.Get<uint8_t>();         
struct AscendC::WelfordFinalizePara para = {rnLength, abLength, head, headLength, tail, tailLength, abRec, rRec};
AscendC::WelfordFinalize<false>(meanLocal, varianceLocal, inputMeanLocal, inputVarianceLocal, inputCountsLocal, tmpLocalTensor, para);