WelfordUpdate
Applicability
Product |
Supported |
|---|---|
√ |
|
√ |
|
x |
|
√ |
|
x |
|
x |
Function
Welford is a method for calculating the mean and variance online. On one hand, this method can gradually calculate the mean and variance of all samples without needing to store these samples, making it ideal for processing large-scale data. On the other hand, it requires only a single data traversal, reducing memory access and enhancing computational performance. This API is used for pre-processing of the Welford algorithm.
If the Reduce axis in the LayerNorm algorithm is large, you can split the axis and use this API and WelfordFinalize together to implement equivalent computation of LayerNorm.
As shown in the following figure, the Reduce axis of the data is split. Assume that the shape of each piece of data after splitting is [1, k], and each piece of data is numbered 1, 2, 3,..., n.

Below is the formula of this API. After data is split, this API is called n times. The following formula is used to compute each piece of split data.


In the formula, the shapes of xi, Meanti, and Mi are all [1, k]. xi indicates the ith data block after splitting. Meanti indicates the mean value of the first i data blocks obtained by calling this API for the ith time. Mi indicates the intermediate result of the variance of the first i data blocks obtained by calling this API for the ith time. (The intermediate result is the result saved for computing the variance. In the following sections.) When this API is called for the first time (i = 1), Meant0 and M0 in the formula are defined by the user as data with the shape of [1, k], and all values of 0.
The following figure shows the computation process of Meantn. After this API is called n times, Meantn and Mn with the shape of [1, k] are obtained. Meantn and Mn are used for subsequent computation by the WelfordFinalize API.

Prototype
- Pass to the temporary space through the sharedTmpBuffer input parameter.
1 2
template <typename T, typename U,bool isReuseSource = false, const WelfordUpdateConfig& config = WFUPDATE_DEFAULT_CFG> __aicore__ inline void WelfordUpdate(const LocalTensor<U>& outputMean, const LocalTensor<U>& outputVariance, const LocalTensor<U>& inputMean, const LocalTensor<U>& inputVariance, const LocalTensor<T>& inputX, const LocalTensor<uint8_t>& sharedTmpBuffer, const WelfordUpdateParam& para)
- Allocate the temporary space through the API framework.
1 2
template <typename T, typename U,bool isReuseSource = false, const WelfordUpdateConfig& config = WFUPDATE_DEFAULT_CFG> __aicore__ inline void WelfordUpdate(const LocalTensor<U>& outputMean, const LocalTensor<U>& outputVariance, const LocalTensor<U>& inputMean, const LocalTensor<U>& inputVariance, const LocalTensor<T>& inputX, const WelfordUpdateParam& para)
Due to the complex computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.
- When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the temporary space.
- When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetWelfordUpdateMaxMinTmpSize API provided in WelfordUpdate Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.
Parameters
Parameter |
Description |
||||
|---|---|---|---|---|---|
T |
Data type of the inputX operand. For the For the For the |
||||
U |
Data type of the outputMean, outputVariance, inputMean, and inputVariance operands. For the For the For the |
||||
isReuseSource |
Whether the source operand can be modified. The default value is false. If you allow the source operand to be modified, enable this parameter to reduce memory space usage. If this parameter is set to true, the inputX memory space is reused during internal computation of this API to reduce memory space usage. If this parameter is set to false, the inputX memory space is not reused during internal computation of this API. In For details about how to use isReuseSource, see Example 4. |
||||
config |
Reuse relationship between the destination operand and source operand that are not in the specified computation scope range. The WelfordUpdateConfig type is defined as follows:
A configuration example is as follows:
This parameter is used together with the tiling computation API in the kernel. |
Parameter |
Input/Output |
Description |
||
|---|---|---|---|---|
outputMean |
Output |
Destination operand of the mean, corresponding to Meanti in the API formula. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The shape must be the same as that of the source operand inputMean. |
||
outputVariance |
Output |
Destination operand of the intermediate variance result, corresponding to Mi in the API formula. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The shape must be the same as that of the source operand inputVariance. |
||
inputMean |
Input |
Source operand of the mean, corresponding to Meanti-1 in the API formula. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
inputVariance |
Input |
Source operand of the intermediate variance result, corresponding to Mi-1 in the API formula. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
inputX |
Input |
Source operand, corresponding to xi in the API formula. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
sharedTmpBuffer |
Input |
Temporary space. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. This parameter is used to store intermediate variables during complex internal API computation and is provided by developers. For details about how to obtain the temporary space size (BufferSize), see WelfordUpdate Tiling. |
||
para |
Input |
Parameter information required for calculation. The WelfordUpdateParam type is defined as follows:
The shape of each destination operand and source operand is [rnLength, abLength]. |
Returns
None
Restrictions
- Currently, the value of para.rnLength can only be 1.
- The value of para.abLength must be an integer multiple of 32/sizeof (T).
- The value of para.abComputeLength must be greater than 0.
- The source operand address must not overlap the destination operand address.
- The address of sharedTmpBuffer must not overlap that of the source or destination operand.
Example
For a complete operator example, see welford_update operator sample.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | #include "kernel_operator.h" constexpr AscendC::WelfordUpdateConfig WELFORD_UPDATE_ENABLE_INPLACE_CFG = { true }; constexpr AscendC::WelfordUpdateConfig WELFORD_UPDATE_UNENABLE_INPLACE_CFG = { false }; template <typename dataType, typename dataTypeU, bool isInplace = false> class KernelWelfordUpdate { public: __aicore__ inline KernelWelfordUpdate() {} __aicore__ inline void Init(GM_ADDR inputX_gm, GM_ADDR inputmean_gm, GM_ADDR inputvar_gm, GM_ADDR outputMean_gm, GM_ADDR outputVariance_gm, uint32_t nLength, uint32_t rLength, uint32_t abComputeLength) { this->nLength = nLength; this->rLength = rLength; this->abComputeLength = abComputeLength; totalLength = nLength * rLength; inputX_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataType *>(inputX_gm), totalLength); inputmean_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataTypeU *>(inputmean_gm), totalLength); inputvar_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataTypeU *>(inputvar_gm), totalLength); outputMean_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataTypeU *>(outputMean_gm), totalLength); outputVariance_global.SetGlobalBuffer(reinterpret_cast<__gm__ dataTypeU *>(outputVariance_gm), totalLength); pipe.InitBuffer(inQueueX, 1, sizeof(dataType) * totalLength); pipe.InitBuffer(inQueueMean, 1, sizeof(dataTypeU) * totalLength); pipe.InitBuffer(inQueueVar, 1, sizeof(dataTypeU) * totalLength); pipe.InitBuffer(outQueueMean, 1, sizeof(dataTypeU) * totalLength); pipe.InitBuffer(outQueueVariance, 1, sizeof(dataTypeU) * totalLength); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<dataType> inputXLocal = inQueueX.AllocTensor<dataType>(); AscendC::LocalTensor<dataTypeU> inmeanLocal = inQueueMean.AllocTensor<dataTypeU>(); AscendC::LocalTensor<dataTypeU> invarLocal = inQueueVar.AllocTensor<dataTypeU>(); AscendC::DataCopy(inputXLocal, inputX_global, totalLength); AscendC::DataCopy(inmeanLocal, inputmean_global, totalLength); AscendC::DataCopy(invarLocal, inputvar_global, totalLength); inQueueX.EnQue(inputXLocal); inQueueMean.EnQue(inmeanLocal); inQueueVar.EnQue(invarLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<dataType> inputXLocal = inQueueX.DeQue<dataType>(); AscendC::LocalTensor<dataTypeU> inmeanLocal = inQueueMean.DeQue<dataTypeU>(); AscendC::LocalTensor<dataTypeU> invarLocal = inQueueVar.DeQue<dataTypeU>(); AscendC::LocalTensor<dataTypeU> meanLocal = outQueueMean.AllocTensor<dataTypeU>(); AscendC::LocalTensor<dataTypeU> varianceLocal = outQueueVariance.AllocTensor<dataTypeU>(); struct AscendC::WelfordUpdateParam para = { nLength, rLength, abComputeLength, 0.3 }; if constexpr (isInplace) { AscendC::WelfordUpdate<dataType, dataTypeU, false, WELFORD_UPDATE_ENABLE_INPLACE_CFG>(meanLocal, varianceLocal, inmeanLocal, invarLocal, inputXLocal, para); } else { AscendC::WelfordUpdate<dataType, dataTypeU, false, WELFORD_UPDATE_UNENABLE_INPLACE_CFG>(meanLocal, varianceLocal, inmeanLocal, invarLocal, inputXLocal, para); } outQueueMean.EnQue<dataTypeU>(meanLocal); outQueueVariance.EnQue<dataTypeU>(varianceLocal); inQueueX.FreeTensor(inputXLocal); inQueueMean.FreeTensor(inmeanLocal); inQueueVar.FreeTensor(invarLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<dataTypeU> meanLocal = outQueueMean.DeQue<dataTypeU>(); AscendC::LocalTensor<dataTypeU> varianceLocal = outQueueVariance.DeQue<dataTypeU>(); AscendC::DataCopy(outputMean_global, meanLocal, totalLength); AscendC::DataCopy(outputVariance_global, varianceLocal, totalLength); outQueueMean.FreeTensor(meanLocal); outQueueVariance.FreeTensor(varianceLocal); } private: AscendC::GlobalTensor<dataType> inputX_global; AscendC::GlobalTensor<dataTypeU> inputmean_global; AscendC::GlobalTensor<dataTypeU> inputvar_global; AscendC::GlobalTensor<dataTypeU> outputMean_global; AscendC::GlobalTensor<dataTypeU> outputVariance_global; AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueMean; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueVar; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueMean; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueVariance; uint32_t nLength; uint32_t rLength; uint32_t abComputeLength; uint32_t totalLength; }; |