Normalize

Function Usage

Computes the reciprocal rstd of the standard deviation of the input data with a shape of [A, R] and y based on the known mean and variance in LayerNorm. The formulas are as follows:

and respectively represent the mean and variance of an input on the R axis. γ is the scaling coefficient, β is the translation coefficient, and ε is the weight coefficient for preventing division by zero.

Prototype

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    1
    2
    template < typename U, typename T, bool isReuseSource = false, const NormalizeConfig& config = NLCFG_NORM>
    __aicore__ inline void Normalize(const LocalTensor<T>& output, const LocalTensor<float>& outputRstd, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<T>& inputX, const LocalTensor<U>& gamma, const LocalTensor<U>& beta, const LocalTensor<uint8_t>& sharedTmpBuffer, const float epsilon, const NormalizePara& para)
    
  • Allocate the temporary space through the API framework.
    1
    2
    template < typename U, typename T, bool isReuseSource = false, const NormalizeConfig& config = NLCFG_NORM>
    __aicore__ inline void Normalize(const LocalTensor<T>& output, const LocalTensor<float>& outputRstd, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<T>& inputX, const LocalTensor<U>& gamma, const LocalTensor<U>& beta, const float epsilon, const NormalizePara& para)
    

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

  • When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.
  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetNormalizeMaxMinTmpSize API provided in Normalize Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Table 1 Parameters in the template

Parameter

Description

U

Data type of the beta and gamma operands.

T

Data type of the inputX operand.

isReuseSource

A reserved parameter with a default value of false.

config

A parameter used to configure the input and output information of the Normalize API. The NormalizeConfig type is defined as follows:

1
2
3
4
5
6
7
struct NormalizeConfig {
    ReducePattern reducePattern = ReducePattern::AR;
    int32_t aLength = -1;
    bool isNoBeta = false;
    bool isNoGamma = false;
    bool isOnlyOutput = false;
};
  • reducePattern: Currently, only the ReducePattern::AR mode is supported, indicating that the input inner R axis is the reduce computation axis.
  • aLength: Size of the input A axis. The following values are supported:
    • -1: Default value. The value of aLength in API parameter para is used as the A axis size.
    • Other values: The value must be the same as that of aLength in interface parameter para.
  • isNoBeta: Whether to use beta in computation.
    • false: Default value. The input beta is used in the Normalize computation.
    • true: The input beta is not used in Normalize computation. In this case, computation related to beta in the formula is omitted.
  • isNoGamma: Whether the optional input gamma is used.
    • false: Default value. The optional input gamma is used in the Normalize computation.
    • true:The input gamma is not used in the Normalize computation. In this case, computation related to gamma in the formula is omitted.
  • isOnlyOutput: Whether to output only y but not the reciprocal rstd of the standard deviation. Currently, this parameter can only be set to false, indicating that all y and rstd results are output.
Table 2 API parameters

Parameter

Input/Output

Meaning

output

Output

Destination operand, with a type of LocalTensor and a shape of [A, R]. For details about the definition of the LocalTensor data structure, see LocalTensor.

outputRstd

Output

Reciprocal of the standard deviation, with a type of LocalTensor and a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor.

inputMean

Input

Mean, with a type of LocalTensor and a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor.

inputVariance

Input

Variance, with a type of LocalTensor and a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor.

inputX

Input

Source operand, with a type of LocalTensor and a shape of [A, R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputX must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.

gamma

Input

Scaling coefficient, with a type of LocalTensor and a shape of [R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type precision of the gamma must be greater than or equal to that of the source operand.

beta

Input

Translation coefficient, with a type of LocalTensor and a shape of [R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type precision of the beta must be greater than or equal to that of the source operand.

sharedTmpBuffer

Input

Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see Normalize Tiling.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

epsilon

Input

Weight coefficient for preventing division by zero.

para

Input

Parameter information required for Normalize computation. The NormalizePara type is defined as follows:

1
2
3
4
5
struct NormalizePara {
    uint32_t aLength;
    uint32_t rLength;
    uint32_t rLengthWithPadding;
};
  • aLength: Specifies the length of inputX along the A axis.
  • rLength: Specifies the length of inputX along the R axis.
  • rLengthWithPadding: Specifies the 32-byte aligned length of inputX along the R axis.

Returns

None

Availability

Precautions

None

Constraints

  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • The data type precision of the scaling coefficient gamma and translation coefficient beta must be greater than or equal to that of the source operand inputX.
  • The tensor space of src and dst cannot be reused.
  • The input must be in ND format.
  • The R axis cannot be split.

Example

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
#include "kernel_operator.h"
constexpr int32_t BUFFER_NUM = 1;  // tensor num for each queue

template <const AscendC::NormalizeConfig& CONFIG>
class KernelNormalize {
 public:
  __aicore__ inline KernelNormalize() {}

  __aicore__ inline void Init(GM_ADDR x, GM_ADDR mean, GM_ADDR variance, GM_ADDR gamma, GM_ADDR beta, GM_ADDR rstd, GM_ADDR y, const float epsilon, const AscendC::NormalizePara& para) {
    this->meanRstdSize = (para.aLength + 7) / 8 * 8;  // Perform 32-byte alignment processing at this stage.
    // get start index for current core, core parallel
    xGm.SetGlobalBuffer((__gm__ DTYPE_X*)x, para.aLength * para.rLengthWithPadding);
    meanGm.SetGlobalBuffer((__gm__ float*)mean, this->meanRstdSize);
    varianceGm.SetGlobalBuffer((__gm__ float*)variance, this->meanRstdSize);
    gammaGm.SetGlobalBuffer((__gm__ DTYPE_GAMMA*)gamma, para.rLengthWithPadding);
    betaGm.SetGlobalBuffer((__gm__ DTYPE_BETA*)beta, para.rLengthWithPadding);

    rstdGm.SetGlobalBuffer((__gm__ float*)rstd, this->meanRstdSize);
    yGm.SetGlobalBuffer((__gm__ DTYPE_Y*)y, para.aLength * para.rLengthWithPadding);

    // pipe alloc memory to queue, the unit is Bytes
    pipe.InitBuffer(inQueueX, BUFFER_NUM, para.aLength * para.rLengthWithPadding * sizeof(DTYPE_X));
    pipe.InitBuffer(inQueueMean, BUFFER_NUM, this->meanRstdSize * sizeof(float));
    pipe.InitBuffer(inQueueVariance, BUFFER_NUM, this->meanRstdSize * sizeof(float));
    pipe.InitBuffer(inQueueGamma, BUFFER_NUM, para.rLengthWithPadding * sizeof(DTYPE_GAMMA));
    pipe.InitBuffer(inQueueBeta, BUFFER_NUM, para.rLengthWithPadding * sizeof(DTYPE_BETA));

    pipe.InitBuffer(outQueueRstd, BUFFER_NUM, this->meanRstdSize * sizeof(float));
    pipe.InitBuffer(outQueueY, BUFFER_NUM, para.aLength * para.rLengthWithPadding * sizeof(DTYPE_Y));

    this->epsilon = epsilon;
    this->para = para;
  }

  __aicore__ inline void Compute() {
    AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.DeQue<DTYPE_X>();
    AscendC::LocalTensor<float> meanLocal = inQueueMean.DeQue<float>();
    AscendC::LocalTensor<float> varianceLocal = inQueueVariance.DeQue<float>();
    AscendC::LocalTensor<DTYPE_GAMMA> gammaLocal = inQueueGamma.DeQue<DTYPE_GAMMA>();
    AscendC::LocalTensor<DTYPE_BETA> betaLocal = inQueueBeta.DeQue<DTYPE_BETA>();

    AscendC::LocalTensor<float> rstdLocal = outQueueRstd.AllocTensor<float>();
    AscendC::LocalTensor<DTYPE_Y> yLocal = outQueueY.AllocTensor<DTYPE_Y>();

    AscendC::Duplicate(rstdLocal, (float)0, this->meanRstdSize);
    AscendC::Duplicate(yLocal, (DTYPE_Y)0, para.aLength * para.rLengthWithPadding);

    AscendC::Normalize<DTYPE_Y, DTYPE_X, false, CONFIG>(yLocal, rstdLocal, meanLocal, varianceLocal, xLocal, gammaLocal, betaLocal, epsilon, para);

    outQueueRstd.EnQue<float>(rstdLocal);
    outQueueY.EnQue<DTYPE_Y>(yLocal);
    inQueueX.FreeTensor(xLocal);
    inQueueMean.FreeTensor(meanLocal);
    inQueueVariance.FreeTensor(varianceLocal);
    inQueueGamma.FreeTensor(gammaLocal);
    inQueueBeta.FreeTensor(betaLocal);

  }
  __aicore__ inline void Process() {
    CopyIn();
    Compute();
    CopyOut();
  }

 private:
  __aicore__ inline void CopyIn() {
    // alloc tensor from queue memory
    AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.AllocTensor<DTYPE_X>();
    AscendC::LocalTensor<float> meanLocal = inQueueMean.AllocTensor<float>();
    AscendC::LocalTensor<float> varianceLocal = inQueueVariance.AllocTensor<float>();
    AscendC::LocalTensor<DTYPE_GAMMA> gammaLocal = inQueueGamma.AllocTensor<DTYPE_GAMMA>();
    AscendC::LocalTensor<DTYPE_BETA> betaLocal = inQueueBeta.AllocTensor<DTYPE_BETA>();
    // copy progress_th tile from global tensor to local tensor
    AscendC::DataCopy(xLocal, xGm, para.aLength * para.rLengthWithPadding);
    AscendC::DataCopy(meanLocal, meanGm, this->meanRstdSize);
    AscendC::DataCopy(varianceLocal, varianceGm, this->meanRstdSize);
    AscendC::DataCopy(gammaLocal, gammaGm, para.rLengthWithPadding);
    AscendC::DataCopy(betaLocal, betaGm, para.rLengthWithPadding);

    // enque input tensors to VECIN queue
    inQueueX.EnQue(xLocal);
    inQueueMean.EnQue(meanLocal);
    inQueueVariance.EnQue(varianceLocal);
    inQueueGamma.EnQue(gammaLocal);
    inQueueBeta.EnQue(betaLocal);
  }

  __aicore__ inline void CopyOut() {
    // deque output tensor from VECOUT queue
    AscendC::LocalTensor<float> rstdLocal = outQueueRstd.DeQue<float>();
    AscendC::LocalTensor<DTYPE_Y> yLocal = outQueueY.DeQue<DTYPE_Y>();
    // copy progress_th tile from local tensor to global tensor
    AscendC::DataCopy(rstdGm, rstdLocal, this->meanRstdSize);
    AscendC::DataCopy(yGm, yLocal, para.aLength * para.rLengthWithPadding);
    // free output tensor for reuse
    outQueueRstd.FreeTensor(rstdLocal);
    outQueueY.FreeTensor(yLocal);
  }

 private:
  AscendC::TPipe pipe;
  // create queues for input, in this case depth is equal to buffer num
  AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX;
  AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueMean;
  AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueVariance;
  AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueGamma;
  AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueBeta;
  // create queue for output, in this case depth is equal to buffer num
  AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueRstd;
  AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueY;

  AscendC::GlobalTensor<float> meanGm;
  AscendC::GlobalTensor<float> varianceGm;
  AscendC::GlobalTensor<DTYPE_X> xGm;
  AscendC::GlobalTensor<DTYPE_GAMMA> gammaGm;
  AscendC::GlobalTensor<DTYPE_BETA> betaGm;

  AscendC::GlobalTensor<float> rstdGm;
  AscendC::GlobalTensor<DTYPE_Y> yGm;

  float epsilon;
  uint32_t meanRstdSize;
  AscendC::NormalizePara para;
};
__aicore__ constexpr AscendC::NormalizeConfig GenConfig(bool isNoBeta, bool isNoGamma)
{
    return {.reducePattern = AscendC::ReducePattern::AR,
        .aLength = -1,
        .isNoBeta = isNoBeta,
        .isNoGamma = isNoGamma,
        .isOnlyOutput = false};
}
// with beta and gamma
constexpr AscendC::NormalizeConfig CONFIG1 = GenConfig(false, false);
constexpr AscendC::NormalizeConfig CONFIG2 = GenConfig(false, true);
constexpr AscendC::NormalizeConfig CONFIG3 = GenConfig(true, false);
constexpr AscendC::NormalizeConfig CONFIG4 = GenConfig(true, true);

extern "C" __global__ __aicore__ void normalize_custom(GM_ADDR x, GM_ADDR mean, GM_ADDR variance, GM_ADDR gamma, GM_ADDR beta, GM_ADDR rstd, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling) {
    GET_TILING_DATA(tilingData, tiling);
    float epsilon = tilingData.espilon;
    AscendC::NormalizePara para(tilingData.aLength, tilingData.rLength, tilingData.rLengthWithPadding);
    if (TILING_KEY_IS(1)) {
      if (!tilingData.isNoBeta && !tilingData.isNoGamma) {
          KernelNormalize<CONFIG1> op;
          op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para);
          op.Process();
      } else if (!tilingData.isNoBeta && tilingData.isNoGamma) {
          KernelNormalize<CONFIG2> op;
          op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para);
          op.Process();
      } else if (tilingData.isNoBeta && !tilingData.isNoGamma) {
          KernelNormalize<CONFIG3> op;
          op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para);
          op.Process();
      } else if (tilingData.isNoBeta && tilingData.isNoGamma) {
          KernelNormalize<CONFIG4> op;
          op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para);
          op.Process();
      }
    }
  }