Normalize

Applicability

Product

Supported

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product's AI Core

Atlas inference product's Vector Core

x

Atlas training products

x

Function

Computes the reciprocal rstd of the standard deviation of the input data with a shape of [A, R] and y based on the known mean and variance in LayerNorm. The formulas are as follows:

E and Var respectively represent the mean and variance of an input on the R axis. γ is the scaling coefficient, β is the translation coefficient, and ε is the weight coefficient for preventing division by zero.

Prototype

  • Pass to the temporary space through the sharedTmpBuffer input parameter.
    1
    2
    template < typename U, typename T, bool isReuseSource = false, const NormalizeConfig& config = NLCFG_NORM>
    __aicore__ inline void Normalize(const LocalTensor<T>& output, const LocalTensor<float>& outputRstd, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<T>& inputX, const LocalTensor<U>& gamma, const LocalTensor<U>& beta, const LocalTensor<uint8_t>& sharedTmpBuffer, const float epsilon, const NormalizePara& para)
    
  • Allocate the temporary space through the API framework.
    1
    2
    template < typename U, typename T, bool isReuseSource = false, const NormalizeConfig& config = NLCFG_NORM>
    __aicore__ inline void Normalize(const LocalTensor<T>& output, const LocalTensor<float>& outputRstd, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<T>& inputX, const LocalTensor<U>& gamma, const LocalTensor<U>& beta, const float epsilon, const NormalizePara& para)
    

Due to the complex computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

  • When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the temporary space.
  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetNormalizeMaxMinTmpSize API provided in Normalize Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Parameters

Table 1 Template parameters

Parameter

Description

U

Data type of the beta and gamma operands.

For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half and float.

For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half and float.

For the Atlas inference product's AI Core, the supported data types are half and float.

T

Data type of the output and inputX operands.

For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half and float.

For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half and float.

For the Atlas inference product's AI Core, the supported data types are half and float.

isReuseSource

This parameter is reserved. Pass the default value false.

config

A parameter used to configure the input and output information of the Normalize API. The NormalizeConfig type is defined as follows:

1
2
3
4
5
6
7
struct NormalizeConfig {
    ReducePattern reducePattern = ReducePattern::AR;
    int32_t aLength = -1;
    bool isNoBeta = false;
    bool isNoGamma = false;
    bool isOnlyOutput = false;
};
  • reducePattern: Currently, only the ReducePattern::AR mode is supported, indicating that the input inner R axis is the Reduce axis.
  • aLength: Size of the input axis A. The following values are supported:
    • –1: Default value. The aLength value of API parameter para is used as the A axis size.
    • Other values: The value must be the same as the aLength value of API parameter para.
  • isNoBeta: Whether to use beta in computation.
    • false: Default value. The input beta is used in the Normalize computation.
    • true: The input beta is not used in the Normalize computation. In this case, computation related to beta in the formula is omitted.
  • isNoGamma: Whether the optional input gamma is used.
    • false: Default value. The optional input gamma is used in Normalize computation.
    • true:The input gamma is not used in the Normalize computation. In this case, computation related to gamma in the formula is omitted.
  • isOnlyOutput: indicates whether to output only y but not the reciprocal rstd of the standard deviation. Currently, this parameter can only be set to false, indicating that all y and rstd results are output.
Table 2 API parameters

Parameter

Input/Output

Description

output

Output

Destination operand, with a shape of [A, R]. For details about the definition of the LocalTensor data structure, see LocalTensor.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

outputRstd

Output

Reciprocal of the standard deviation, with a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputMean

Input

Mean, with a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputVariance

Input

Variance, with a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

inputX

Input

Source operand, with a shape of [A, R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputX must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

gamma

Input

Scaling coefficient, with a shape of [R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type precision of gamma must be greater than or equal to that of the source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

beta

Input

Translation coefficient, with a shape of [R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type precision of beta must be greater than or equal to that of the source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

sharedTmpBuffer

Input

Shared buffer, which is used to store temporary data generated during internal API computation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see Normalize Tiling.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

epsilon

Input

Weight coefficient for preventing division by zero.

para

Input

Parameter information required for the Normalize computation. The NormalizePara type is defined as follows:

1
2
3
4
5
struct NormalizePara {
    uint32_t aLength;
    uint32_t rLength;
    uint32_t rLengthWithPadding;
};
  • aLength: Specifies the length of inputX along the A axis.
  • rLength: specifies the length of inputX along the R axis.
  • rLengthWithPadding: specifies the 32-byte aligned length of inputX along the R axis.

Returns

None

Restrictions

  • For details about the operand address alignment requirements, see General Address Alignment Restrictions.
  • The data type precision of the scale coefficient gamma and translation coefficient beta must be greater than or equal to that of the source operand inputX. For example, if the data type of inputX is half, the data types of gamma and beta can be half or float, and their data type precision is not lower than that of inputX.
  • The tensor space of src and dst cannot be reused.
  • The input must be in ND format.
  • The R axis cannot be split.

Example

For more examples, see normalize operator sample.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
#include "kernel_operator.h"
constexpr int32_t BUFFER_NUM = 1;  // tensor num for each queue

template <const AscendC::NormalizeConfig& CONFIG>
class KernelNormalize {
 public:
  __aicore__ inline KernelNormalize() {}

  __aicore__ inline void Init(GM_ADDR x, GM_ADDR mean, GM_ADDR variance, GM_ADDR gamma, GM_ADDR beta, GM_ADDR rstd, GM_ADDR y, const float epsilon, const AscendC::NormalizePara& para) {
    this->meanRstdSize = (para.aLength + 7) / 8 * 8;  // Perform 32-byte alignment processing at this stage.
    // get start index for current core, core parallel
    xGm.SetGlobalBuffer((__gm__ DTYPE_X*)x, para.aLength * para.rLengthWithPadding);
    meanGm.SetGlobalBuffer((__gm__ float*)mean, this->meanRstdSize);
    varianceGm.SetGlobalBuffer((__gm__ float*)variance, this->meanRstdSize);
    gammaGm.SetGlobalBuffer((__gm__ DTYPE_GAMMA*)gamma, para.rLengthWithPadding);
    betaGm.SetGlobalBuffer((__gm__ DTYPE_BETA*)beta, para.rLengthWithPadding);

    rstdGm.SetGlobalBuffer((__gm__ float*)rstd, this->meanRstdSize);
    yGm.SetGlobalBuffer((__gm__ DTYPE_Y*)y, para.aLength * para.rLengthWithPadding);

    // pipe alloc memory to queue, the unit is Bytes
    pipe.InitBuffer(inQueueX, BUFFER_NUM, para.aLength * para.rLengthWithPadding * sizeof(DTYPE_X));
    pipe.InitBuffer(inQueueMean, BUFFER_NUM, this->meanRstdSize * sizeof(float));
    pipe.InitBuffer(inQueueVariance, BUFFER_NUM, this->meanRstdSize * sizeof(float));
    pipe.InitBuffer(inQueueGamma, BUFFER_NUM, para.rLengthWithPadding * sizeof(DTYPE_GAMMA));
    pipe.InitBuffer(inQueueBeta, BUFFER_NUM, para.rLengthWithPadding * sizeof(DTYPE_BETA));

    pipe.InitBuffer(outQueueRstd, BUFFER_NUM, this->meanRstdSize * sizeof(float));
    pipe.InitBuffer(outQueueY, BUFFER_NUM, para.aLength * para.rLengthWithPadding * sizeof(DTYPE_Y));

    this->epsilon = epsilon;
    this->para = para;
  }

  __aicore__ inline void Compute() {
    AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.DeQue<DTYPE_X>();
    AscendC::LocalTensor<float> meanLocal = inQueueMean.DeQue<float>();
    AscendC::LocalTensor<float> varianceLocal = inQueueVariance.DeQue<float>();
    AscendC::LocalTensor<DTYPE_GAMMA> gammaLocal = inQueueGamma.DeQue<DTYPE_GAMMA>();
    AscendC::LocalTensor<DTYPE_BETA> betaLocal = inQueueBeta.DeQue<DTYPE_BETA>();

    AscendC::LocalTensor<float> rstdLocal = outQueueRstd.AllocTensor<float>();
    AscendC::LocalTensor<DTYPE_Y> yLocal = outQueueY.AllocTensor<DTYPE_Y>();

    AscendC::Duplicate(rstdLocal, (float)0, this->meanRstdSize);
    AscendC::Duplicate(yLocal, (DTYPE_Y)0, para.aLength * para.rLengthWithPadding);

    AscendC::Normalize<DTYPE_Y, DTYPE_X, false, CONFIG>(yLocal, rstdLocal, meanLocal, varianceLocal, xLocal, gammaLocal, betaLocal, epsilon, para);

    outQueueRstd.EnQue<float>(rstdLocal);
    outQueueY.EnQue<DTYPE_Y>(yLocal);
    inQueueX.FreeTensor(xLocal);
    inQueueMean.FreeTensor(meanLocal);
    inQueueVariance.FreeTensor(varianceLocal);
    inQueueGamma.FreeTensor(gammaLocal);
    inQueueBeta.FreeTensor(betaLocal);

  }
  __aicore__ inline void Process() {
    CopyIn();
    Compute();
    CopyOut();
  }

 private:
  __aicore__ inline void CopyIn() {
    // alloc tensor from queue memory
    AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.AllocTensor<DTYPE_X>();
    AscendC::LocalTensor<float> meanLocal = inQueueMean.AllocTensor<float>();
    AscendC::LocalTensor<float> varianceLocal = inQueueVariance.AllocTensor<float>();
    AscendC::LocalTensor<DTYPE_GAMMA> gammaLocal = inQueueGamma.AllocTensor<DTYPE_GAMMA>();
    AscendC::LocalTensor<DTYPE_BETA> betaLocal = inQueueBeta.AllocTensor<DTYPE_BETA>();
    // copy progress_th tile from global tensor to local tensor
    AscendC::DataCopy(xLocal, xGm, para.aLength * para.rLengthWithPadding);
    AscendC::DataCopy(meanLocal, meanGm, this->meanRstdSize);
    AscendC::DataCopy(varianceLocal, varianceGm, this->meanRstdSize);
    AscendC::DataCopy(gammaLocal, gammaGm, para.rLengthWithPadding);
    AscendC::DataCopy(betaLocal, betaGm, para.rLengthWithPadding);

    // enque input tensors to VECIN queue
    inQueueX.EnQue(xLocal);
    inQueueMean.EnQue(meanLocal);
    inQueueVariance.EnQue(varianceLocal);
    inQueueGamma.EnQue(gammaLocal);
    inQueueBeta.EnQue(betaLocal);
  }

  __aicore__ inline void CopyOut() {
    // deque output tensor from VECOUT queue
    AscendC::LocalTensor<float> rstdLocal = outQueueRstd.DeQue<float>();
    AscendC::LocalTensor<DTYPE_Y> yLocal = outQueueY.DeQue<DTYPE_Y>();
    // copy progress_th tile from local tensor to global tensor
    AscendC::DataCopy(rstdGm, rstdLocal, this->meanRstdSize);
    AscendC::DataCopy(yGm, yLocal, para.aLength * para.rLengthWithPadding);
    // free output tensor for reuse
    outQueueRstd.FreeTensor(rstdLocal);
    outQueueY.FreeTensor(yLocal);
  }

 private:
  AscendC::TPipe pipe;
  // create queues for input, in this case depth is equal to buffer num
  AscendC::TQue<AscendC::TPosition::VECIN, BUFFER_NUM> inQueueX;
  AscendC::TQue<AscendC::TPosition::VECIN, BUFFER_NUM> inQueueMean;
  AscendC::TQue<AscendC::TPosition::VECIN, BUFFER_NUM> inQueueVariance;
  AscendC::TQue<AscendC::TPosition::VECIN, BUFFER_NUM> inQueueGamma;
  AscendC::TQue<AscendC::TPosition::VECIN, BUFFER_NUM> inQueueBeta;
  // create queue for output, in this case depth is equal to buffer num
  AscendC::TQue<AscendC::TPosition::VECOUT, BUFFER_NUM> outQueueRstd;
  AscendC::TQue<AscendC::TPosition::VECOUT, BUFFER_NUM> outQueueY;

  AscendC::GlobalTensor<float> meanGm;
  AscendC::GlobalTensor<float> varianceGm;
  AscendC::GlobalTensor<DTYPE_X> xGm;
  AscendC::GlobalTensor<DTYPE_GAMMA> gammaGm;
  AscendC::GlobalTensor<DTYPE_BETA> betaGm;

  AscendC::GlobalTensor<float> rstdGm;
  AscendC::GlobalTensor<DTYPE_Y> yGm;

  float epsilon;
  uint32_t meanRstdSize;
  AscendC::NormalizePara para;
};
__aicore__ constexpr AscendC::NormalizeConfig GenConfig(bool isNoBeta, bool isNoGamma)
{
    return {.reducePattern = AscendC::ReducePattern::AR,
        .aLength = -1,
        .isNoBeta = isNoBeta,
        .isNoGamma = isNoGamma,
        .isOnlyOutput = false};
}
// with beta and gamma
constexpr AscendC::NormalizeConfig CONFIG1 = GenConfig(false, false);
constexpr AscendC::NormalizeConfig CONFIG2 = GenConfig(false, true);
constexpr AscendC::NormalizeConfig CONFIG3 = GenConfig(true, false);
constexpr AscendC::NormalizeConfig CONFIG4 = GenConfig(true, true);

extern "C" __global__ __aicore__ void normalize_custom(GM_ADDR x, GM_ADDR mean, GM_ADDR variance, GM_ADDR gamma, GM_ADDR beta, GM_ADDR rstd, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling) {
    GET_TILING_DATA(tilingData, tiling);
    float epsilon = tilingData.epsilon;
    AscendC::NormalizePara para(tilingData.aLength, tilingData.rLength, tilingData.rLengthWithPadding);
    if (TILING_KEY_IS(1)) {
      if (!tilingData.isNoBeta && !tilingData.isNoGamma) {
          KernelNormalize<CONFIG1> op;
          op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para);
          op.Process();
      } else if (!tilingData.isNoBeta && tilingData.isNoGamma) {
          KernelNormalize<CONFIG2> op;
          op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para);
          op.Process();
      } else if (tilingData.isNoBeta && !tilingData.isNoGamma) {
          KernelNormalize<CONFIG3> op;
          op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para);
          op.Process();
      } else if (tilingData.isNoBeta && tilingData.isNoGamma) {
          KernelNormalize<CONFIG4> op;
          op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para);
          op.Process();
      }
    }
  }