WelfordUpdate

Function Usage

Welford is a method for calculating the mean and variance online. On one hand, this method can gradually calculate the mean and variance of all samples without needing to store these samples, making it ideal for processing large-scale data. On the other hand, it requires only a single data traversal, reducing memory access and enhancing computational performance. This API is used for pre-processing of the Welford algorithm.

If the reduced axis in the LayerNorm algorithm is large, you can split the reduced axis and use this API and WelfordFinalize together to implement equivalent calculation of LayerNorm. The calculation formula of this API is as follows:

and respectively represent a mean value and a variance of n pieces of data, and represents the value of the nth point.

Prototype

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    • The data types of the mean value and variance are not fixed.
      1
      2
      template <typename T, typename U,bool isReuseSource = false, const WelfordUpdateConfig& config = WFUPDATE_DEFAULT_CFG>
      __aicore__ inline void WelfordUpdate(const LocalTensor<U>& outputMean, const LocalTensor<U>& outputVariance, const LocalTensor<U>& inputMean, const LocalTensor<U>& inputVariance, const LocalTensor<T>& inputX, const LocalTensor<uint8_t>& sharedTmpBuffer, const WelfordUpdateParam& para)
      
  • Allocate the temporary space through the API framework.
    • The data types of the mean value and variance are not fixed.
      1
      2
      template <typename T, typename U,bool isReuseSource = false, const WelfordUpdateConfig& config = WFUPDATE_DEFAULT_CFG>
      __aicore__ inline void WelfordUpdate(const LocalTensor<U>& outputMean, const LocalTensor<U>& outputVariance, const LocalTensor<U>& inputMean, const LocalTensor<U>& inputVariance, const LocalTensor<T>& inputX, const WelfordUpdateParam& para)
      

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by you through the sharedTmpBuffer input parameter.

  • When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the space.
  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, reserve the temporary space. If sharedTmpBuffer is used, allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetWelfordUpdateMaxMinTmpSize API provided in WelfordUpdate Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the inputX operand.

U

Data type of the outputMean, outputVariance, inputMean, and inputVariance operands.

isReuseSource

Whether the source operand can be modified. The default value is false. If you are allowed to modify the source operand, enable this parameter, to reduce memory space usage.

If this parameter is set to true, the inputX memory space is reused during internal computation of this API to reduce memory space usage. If this parameter is set to false, the inputX memory space is not reused during internal computation of this API.

For details about how to use isReuseSource, see More Examples.

config

Reuse relationship between the destination operand and source operand that are not in the specified computation scope range. The WelfordUpdateConfig type is defined as follows:

1
2
3
struct WelfordUpdateConfig {
    bool isInplace = false; // Whether the destination operand reuses the source operand.
};
  • isInplace: The abComputeLength parameter under para in API parameters specifies the calculation length for the inner axis of the input data. The value of the output data beyond the calculation length is determined by the isInplace parameter. The isInplace parameter indicates whether to reuse the source operand for the destination operand beyond the specified calculation length. If yes, the source operand at the corresponding position replaces the destination operand when the output is beyond the specified calculation length. If no, this API does not output the destination operands beyond the calculation length.
    • false (default value): The destination operand does not reuse the source operand.
    • true: The destination operand reuses the source operand. outputMean reuses inputMean, and outputVariance reuses inputVariance.

A configuration example is as follows:

1
constexpr WelfordUpdateConfig WFUPDATE_DEFAULT_CFG = {false};

This parameter is used together with the tiling computation API on the kernel.

Table 2 API parameters

Parameter

Input/Output

Description

outputMean

Output

Mean, destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The shape must be the same as that of the source operand inputMean.

outputVariance

Output

Variance, destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The shape must be the same as that of the source operand inputVariance.

inputMean

Input

Mean, source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputVariance

Input

Variance, source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputX

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

sharedTmpBuffer

Input

Temporary space.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex internal API computation and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see WelfordUpdate Tiling.

para

Input

Parameter information required for calculation. The WelfordUpdateParam type is defined as follows:

1
2
3
4
5
6
struct WelfordUpdateParam {
    uint32_t rnLength; // Number of times that the reduced axis is split by abLength (rounded down)
    uint32_t abLength; // Size of the reduced axis split
    uint32_t abComputeLength; // Actual calculated length of the reduced axis
    float nRec; // The value is 1/abComputeLength.
};
  • rnLength indicates the number of times that the reduced axis is split by abLength (rounded down).
  • abLength indicates the size of the reduced axis split.
  • abComputeLength indicates the actual length of the reduced axis calculated from the input start address.
  • nRec indicates the reciprocal of abComputeLength, that is, 1/abComputeLength.

The shape of each destination operand and source operand is [rnLength, abLength].

Returns

None

Availability

Precautions

Example

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
#include "kernel_operator.h"

constexpr AscendC::WelfordUpdateConfig WELFORD_UPDATE_ENABLE_INPLACE_CFG = { true };
constexpr AscendC::WelfordUpdateConfig WELFORD_UPDATE_UNENABLE_INPLACE_CFG = { false };

template <typename dataType, typename dataTypeU, bool isInplace = false> class KernelWelfordUpdate {
public:
    __aicore__ inline KernelWelfordUpdate() {}
    __aicore__ inline void Init(GM_ADDR inputX_gm, GM_ADDR inputmean_gm, GM_ADDR inputvar_gm, GM_ADDR outputMean_gm,
        GM_ADDR outputVariance_gm, uint32_t nLength, uint32_t rLength, uint32_t abComputeLength)
    {
        this->nLength = nLength;
        this->rLength = rLength;
        this->abComputeLength = abComputeLength;
        totalLength = nLength * rLength;

        inputX_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(inputX_gm), totalLength);
        inputmean_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataTypeU *>(inputmean_gm), totalLength);
        inputvar_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataTypeU *>(inputvar_gm), totalLength);

        outputMean_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataTypeU *>(outputMean_gm), totalLength);
        outputVariance_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataTypeU *>(outputVariance_gm), totalLength);

        pipe.InitBuffer(inQueueX, 1, sizeof(dataType) * totalLength);
        pipe.InitBuffer(inQueueMean, 1, sizeof(dataTypeU) * totalLength);
        pipe.InitBuffer(inQueueVar, 1, sizeof(dataTypeU) * totalLength);
        pipe.InitBuffer(outQueueMean, 1, sizeof(dataTypeU) * totalLength);
        pipe.InitBuffer(outQueueVariance, 1, sizeof(dataTypeU) * totalLength);
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<dataType> inputXLocal = inQueueX.AllocTensor<dataType>();
        AscendC::LocalTensor<dataTypeU> inmeanLocal = inQueueMean.AllocTensor<dataTypeU>();
        AscendC::LocalTensor<dataTypeU> invarLocal = inQueueVar.AllocTensor<dataTypeU>();

        AscendC::DataCopy(inputXLocal, inputX_global, totalLength);
        AscendC::DataCopy(inmeanLocal, inputmean_global, totalLength);
        AscendC::DataCopy(invarLocal, inputvar_global, totalLength);

        inQueueX.EnQue(inputXLocal);
        inQueueMean.EnQue(inmeanLocal);
        inQueueVar.EnQue(invarLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<dataType> inputXLocal = inQueueX.DeQue<dataType>();
        AscendC::LocalTensor<dataTypeU> inmeanLocal = inQueueMean.DeQue<dataTypeU>();
        AscendC::LocalTensor<dataTypeU> invarLocal = inQueueVar.DeQue<dataTypeU>();

        AscendC::LocalTensor<dataTypeU> meanLocal = outQueueMean.AllocTensor<dataTypeU>();
        AscendC::LocalTensor<dataTypeU> varianceLocal = outQueueVariance.AllocTensor<dataTypeU>();

        struct AscendC::WelfordUpdateParam para = { nLength, rLength, abComputeLength, 0.3 };
        if constexpr (isInplace) {
            AscendC::WelfordUpdate<dataType, dataTypeU, false, WELFORD_UPDATE_ENABLE_INPLACE_CFG>(meanLocal, varianceLocal,
                inmeanLocal, invarLocal, inputXLocal, para);
        } else {
            AscendC::WelfordUpdate<dataType, dataTypeU, false, WELFORD_UPDATE_UNENABLE_INPLACE_CFG>(meanLocal, varianceLocal,
                inmeanLocal, invarLocal, inputXLocal, para);
        }

        outQueueMean.EnQue<dataTypeU>(meanLocal);
        outQueueVariance.EnQue<dataTypeU>(varianceLocal);

        inQueueX.FreeTensor(inputXLocal);
        inQueueMean.FreeTensor(inmeanLocal);
        inQueueVar.FreeTensor(invarLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<dataTypeU> meanLocal = outQueueMean.DeQue<dataTypeU>();
        AscendC::LocalTensor<dataTypeU> varianceLocal = outQueueVariance.DeQue<dataTypeU>();

        AscendC::DataCopy(outputMean_global, meanLocal, totalLength);
        AscendC::DataCopy(outputVariance_global, varianceLocal, totalLength);

        outQueueMean.FreeTensor(meanLocal);
        outQueueVariance.FreeTensor(varianceLocal);
    }

private:
    AscendC::GlobalTensor<dataType> inputX_global;
    AscendC::GlobalTensor<dataTypeU> inputmean_global;
    AscendC::GlobalTensor<dataTypeU> inputvar_global;
    AscendC::GlobalTensor<dataTypeU> outputMean_global;
    AscendC::GlobalTensor<dataTypeU> outputVariance_global;

    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueMean;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueVar;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueMean;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueVariance;

    uint32_t nLength;
    uint32_t rLength;
    uint32_t abComputeLength;
    uint32_t totalLength;
};