WelfordUpdate

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Welford is a method for calculating the mean and variance online. On one hand, this method can gradually calculate the mean and variance of all samples without needing to store these samples, making it ideal for processing large-scale data. On the other hand, it requires only a single data traversal, reducing memory access and enhancing computational performance. This API is used for pre-processing of the Welford algorithm.

If the Reduce axis in the LayerNorm algorithm is large, you can split the axis and use this API and WelfordFinalize together to implement equivalent computation of LayerNorm.

As shown in the following figure, the Reduce axis of the data is split. Assume that the shape of each piece of data after splitting is [1, k], and each piece of data is numbered 1, 2, 3,..., n.

Figure 1 Splitting of the Reduce axis

Below is the formula of this API. After data is split, this API is called n times. The following formula is used to compute each piece of split data.

In the formula, the shapes of x_i, Meant_i, and M_i are all [1, k]. x_i indicates the ith data block after splitting. Meant_i indicates the mean value of the first i data blocks obtained by calling this API for the ith time. M_i indicates the intermediate result of the variance of the first i data blocks obtained by calling this API for the ith time. (The intermediate result is saved for computing the variance in the following sections.) When this API is called for the first time (i = 1), Meant₀ and M₀ in the formula are defined by the user as data with the shape of [1, k], and all values of 0.

The following figure shows the computation process of Meant_n. After this API is called n times, Meant_n and M_n with the shape of [1, k] are obtained. Meant_n and M_n are used for subsequent computation by the WelfordFinalize API.

Figure 2 Meant_n computation process

Prototype

Pass to the temporary space through the sharedTmpBuffer input parameter.

template <typename T, typename U,bool isReuseSource = false, const WelfordUpdateConfig& config = WFUPDATE_DEFAULT_CFG>
__aicore__ inline void WelfordUpdate(const LocalTensor<U>& outputMean, const LocalTensor<U>& outputVariance, const LocalTensor<U>& inputMean, const LocalTensor<U>& inputVariance, const LocalTensor<T>& inputX, const LocalTensor<uint8_t>& sharedTmpBuffer, const WelfordUpdateParam& para)

Allocate the temporary space through the API framework.

template <typename T, typename U,bool isReuseSource = false, const WelfordUpdateConfig& config = WFUPDATE_DEFAULT_CFG>
__aicore__ inline void WelfordUpdate(const LocalTensor<U>& outputMean, const LocalTensor<U>& outputVariance, const LocalTensor<U>& inputMean, const LocalTensor<U>& inputVariance, const LocalTensor<T>& inputX, const WelfordUpdateParam& para)

Due to the complex computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the temporary space.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetWelfordUpdateMaxMinTmpSize API provided in WelfordUpdate Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Parameters

Table 1 Template parameters

Parameter

Description

Data type of the inputX operand.

For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half and float.

For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half and float.

For the Atlas inference product's AI Core, the supported data types are half and float.

Data type of the outputMean, outputVariance, inputMean, and inputVariance operands.

For the Atlas A3 training products/Atlas A3 inference products, the supported data type is float.

For the Atlas A2 training products/Atlas A2 inference products, the supported data type is float.

For the Atlas inference product's AI Core, the supported data type is float.

isReuseSource

Whether the source operand can be modified. The default value is false. If you allow the source operand to be modified, enable this parameter to reduce memory space usage.

If this parameter is set to true, the inputX memory space is reused during internal computation of this API to reduce memory space usage. If this parameter is set to false, the inputX memory space is not reused during internal computation of this API.

In Atlas inference product's AI Core, this parameter is reserved. Pass the default value false.

For details about how to use isReuseSource, see Example 4.

config

Reuse relationship between the destination operand and source operand that are not in the specified computation scope range. The WelfordUpdateConfig type is defined as follows:

struct WelfordUpdateConfig {
    bool isInplace = false; // Whether the destination operand reuses the source operand.
};

isInplace: The abComputeLength parameter under para in API parameters specifies the calculation length for the inner axis of the input data. The value of the output data beyond the calculation length is determined by the isInplace parameter. The isInplace parameter indicates whether to reuse the source operand for the destination operand beyond the specified calculation length. If yes, the source operand at the corresponding position replaces the destination operand when the output is beyond the specified calculation length. If no, this API does not output the destination operands beyond the calculation length.
- false (default value): The destination operand does not reuse the source operand.
- true: The destination operand reuses the source operand. outputMean reuses inputMean, and outputVariance reuses inputVariance.

A configuration example is as follows:

constexpr WelfordUpdateConfig WFUPDATE_DEFAULT_CFG = {false};

This parameter is used together with the tiling computation API in the kernel.

Table 2 API parameters

Parameter

Input/Output

Description

outputMean

Output

Destination operand of the mean, corresponding to Meant_i in the API formula.