Normalize
Function Usage
Computes the reciprocal rstd of the standard deviation of the input data with a shape of [A, R] and y based on the known mean and variance in LayerNorm. The formulas are as follows:


and
respectively represent the mean and variance of an input on the R axis. γ is the scaling coefficient, β is the translation coefficient, and ε is the weight coefficient for preventing division by zero.
Prototype
- Pass the temporary space through the sharedTmpBuffer input parameter.
1 2
template < typename U, typename T, bool isReuseSource = false, const NormalizeConfig& config = NLCFG_NORM> __aicore__ inline void Normalize(const LocalTensor<T>& output, const LocalTensor<float>& outputRstd, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<T>& inputX, const LocalTensor<U>& gamma, const LocalTensor<U>& beta, const LocalTensor<uint8_t>& sharedTmpBuffer, const float epsilon, const NormalizePara& para)
- Allocate the temporary space through the API framework.
1 2
template < typename U, typename T, bool isReuseSource = false, const NormalizeConfig& config = NLCFG_NORM> __aicore__ inline void Normalize(const LocalTensor<T>& output, const LocalTensor<float>& outputRstd, const LocalTensor<float>& inputMean, const LocalTensor<float>& inputVariance, const LocalTensor<T>& inputX, const LocalTensor<U>& gamma, const LocalTensor<U>& beta, const float epsilon, const NormalizePara& para)
Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.
- When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.
- When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetNormalizeMaxMinTmpSize API provided in Normalize Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.
Parameter |
Description |
||
|---|---|---|---|
U |
Data type of the beta and gamma operands. |
||
T |
Data type of the inputX operand. |
||
isReuseSource |
A reserved parameter with a default value of false. |
||
config |
A parameter used to configure the input and output information of the Normalize API. The NormalizeConfig type is defined as follows:
|
Parameter |
Input/Output |
Meaning |
||
|---|---|---|---|---|
output |
Output |
Destination operand, with a type of LocalTensor and a shape of [A, R]. For details about the definition of the LocalTensor data structure, see LocalTensor. |
||
outputRstd |
Output |
Reciprocal of the standard deviation, with a type of LocalTensor and a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor. |
||
inputMean |
Input |
Mean, with a type of LocalTensor and a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor. |
||
inputVariance |
Input |
Variance, with a type of LocalTensor and a shape of [A]. For details about the definition of the LocalTensor data structure, see LocalTensor. |
||
inputX |
Input |
Source operand, with a type of LocalTensor and a shape of [A, R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputX must be the same as that of the destination operand, and the last axis length must be 32-byte aligned. |
||
gamma |
Input |
Scaling coefficient, with a type of LocalTensor and a shape of [R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type precision of the gamma must be greater than or equal to that of the source operand. |
||
beta |
Input |
Translation coefficient, with a type of LocalTensor and a shape of [R]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type precision of the beta must be greater than or equal to that of the source operand. |
||
sharedTmpBuffer |
Input |
Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see Normalize Tiling. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
epsilon |
Input |
Weight coefficient for preventing division by zero. |
||
para |
Input |
Parameter information required for Normalize computation. The NormalizePara type is defined as follows:
|
Returns
None
Availability
Precautions
None
Constraints
- For details about the alignment requirements of the operand address offset, see General Restrictions.
- The data type precision of the scaling coefficient gamma and translation coefficient beta must be greater than or equal to that of the source operand inputX.
- The tensor space of src and dst cannot be reused.
- The input must be in ND format.
- The R axis cannot be split.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | #include "kernel_operator.h" constexpr int32_t BUFFER_NUM = 1; // tensor num for each queue template <const AscendC::NormalizeConfig& CONFIG> class KernelNormalize { public: __aicore__ inline KernelNormalize() {} __aicore__ inline void Init(GM_ADDR x, GM_ADDR mean, GM_ADDR variance, GM_ADDR gamma, GM_ADDR beta, GM_ADDR rstd, GM_ADDR y, const float epsilon, const AscendC::NormalizePara& para) { this->meanRstdSize = (para.aLength + 7) / 8 * 8; // Perform 32-byte alignment processing at this stage. // get start index for current core, core parallel xGm.SetGlobalBuffer((__gm__ DTYPE_X*)x, para.aLength * para.rLengthWithPadding); meanGm.SetGlobalBuffer((__gm__ float*)mean, this->meanRstdSize); varianceGm.SetGlobalBuffer((__gm__ float*)variance, this->meanRstdSize); gammaGm.SetGlobalBuffer((__gm__ DTYPE_GAMMA*)gamma, para.rLengthWithPadding); betaGm.SetGlobalBuffer((__gm__ DTYPE_BETA*)beta, para.rLengthWithPadding); rstdGm.SetGlobalBuffer((__gm__ float*)rstd, this->meanRstdSize); yGm.SetGlobalBuffer((__gm__ DTYPE_Y*)y, para.aLength * para.rLengthWithPadding); // pipe alloc memory to queue, the unit is Bytes pipe.InitBuffer(inQueueX, BUFFER_NUM, para.aLength * para.rLengthWithPadding * sizeof(DTYPE_X)); pipe.InitBuffer(inQueueMean, BUFFER_NUM, this->meanRstdSize * sizeof(float)); pipe.InitBuffer(inQueueVariance, BUFFER_NUM, this->meanRstdSize * sizeof(float)); pipe.InitBuffer(inQueueGamma, BUFFER_NUM, para.rLengthWithPadding * sizeof(DTYPE_GAMMA)); pipe.InitBuffer(inQueueBeta, BUFFER_NUM, para.rLengthWithPadding * sizeof(DTYPE_BETA)); pipe.InitBuffer(outQueueRstd, BUFFER_NUM, this->meanRstdSize * sizeof(float)); pipe.InitBuffer(outQueueY, BUFFER_NUM, para.aLength * para.rLengthWithPadding * sizeof(DTYPE_Y)); this->epsilon = epsilon; this->para = para; } __aicore__ inline void Compute() { AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.DeQue<DTYPE_X>(); AscendC::LocalTensor<float> meanLocal = inQueueMean.DeQue<float>(); AscendC::LocalTensor<float> varianceLocal = inQueueVariance.DeQue<float>(); AscendC::LocalTensor<DTYPE_GAMMA> gammaLocal = inQueueGamma.DeQue<DTYPE_GAMMA>(); AscendC::LocalTensor<DTYPE_BETA> betaLocal = inQueueBeta.DeQue<DTYPE_BETA>(); AscendC::LocalTensor<float> rstdLocal = outQueueRstd.AllocTensor<float>(); AscendC::LocalTensor<DTYPE_Y> yLocal = outQueueY.AllocTensor<DTYPE_Y>(); AscendC::Duplicate(rstdLocal, (float)0, this->meanRstdSize); AscendC::Duplicate(yLocal, (DTYPE_Y)0, para.aLength * para.rLengthWithPadding); AscendC::Normalize<DTYPE_Y, DTYPE_X, false, CONFIG>(yLocal, rstdLocal, meanLocal, varianceLocal, xLocal, gammaLocal, betaLocal, epsilon, para); outQueueRstd.EnQue<float>(rstdLocal); outQueueY.EnQue<DTYPE_Y>(yLocal); inQueueX.FreeTensor(xLocal); inQueueMean.FreeTensor(meanLocal); inQueueVariance.FreeTensor(varianceLocal); inQueueGamma.FreeTensor(gammaLocal); inQueueBeta.FreeTensor(betaLocal); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { // alloc tensor from queue memory AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.AllocTensor<DTYPE_X>(); AscendC::LocalTensor<float> meanLocal = inQueueMean.AllocTensor<float>(); AscendC::LocalTensor<float> varianceLocal = inQueueVariance.AllocTensor<float>(); AscendC::LocalTensor<DTYPE_GAMMA> gammaLocal = inQueueGamma.AllocTensor<DTYPE_GAMMA>(); AscendC::LocalTensor<DTYPE_BETA> betaLocal = inQueueBeta.AllocTensor<DTYPE_BETA>(); // copy progress_th tile from global tensor to local tensor AscendC::DataCopy(xLocal, xGm, para.aLength * para.rLengthWithPadding); AscendC::DataCopy(meanLocal, meanGm, this->meanRstdSize); AscendC::DataCopy(varianceLocal, varianceGm, this->meanRstdSize); AscendC::DataCopy(gammaLocal, gammaGm, para.rLengthWithPadding); AscendC::DataCopy(betaLocal, betaGm, para.rLengthWithPadding); // enque input tensors to VECIN queue inQueueX.EnQue(xLocal); inQueueMean.EnQue(meanLocal); inQueueVariance.EnQue(varianceLocal); inQueueGamma.EnQue(gammaLocal); inQueueBeta.EnQue(betaLocal); } __aicore__ inline void CopyOut() { // deque output tensor from VECOUT queue AscendC::LocalTensor<float> rstdLocal = outQueueRstd.DeQue<float>(); AscendC::LocalTensor<DTYPE_Y> yLocal = outQueueY.DeQue<DTYPE_Y>(); // copy progress_th tile from local tensor to global tensor AscendC::DataCopy(rstdGm, rstdLocal, this->meanRstdSize); AscendC::DataCopy(yGm, yLocal, para.aLength * para.rLengthWithPadding); // free output tensor for reuse outQueueRstd.FreeTensor(rstdLocal); outQueueY.FreeTensor(yLocal); } private: AscendC::TPipe pipe; // create queues for input, in this case depth is equal to buffer num AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX; AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueMean; AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueVariance; AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueGamma; AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueBeta; // create queue for output, in this case depth is equal to buffer num AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueRstd; AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueY; AscendC::GlobalTensor<float> meanGm; AscendC::GlobalTensor<float> varianceGm; AscendC::GlobalTensor<DTYPE_X> xGm; AscendC::GlobalTensor<DTYPE_GAMMA> gammaGm; AscendC::GlobalTensor<DTYPE_BETA> betaGm; AscendC::GlobalTensor<float> rstdGm; AscendC::GlobalTensor<DTYPE_Y> yGm; float epsilon; uint32_t meanRstdSize; AscendC::NormalizePara para; }; __aicore__ constexpr AscendC::NormalizeConfig GenConfig(bool isNoBeta, bool isNoGamma) { return {.reducePattern = AscendC::ReducePattern::AR, .aLength = -1, .isNoBeta = isNoBeta, .isNoGamma = isNoGamma, .isOnlyOutput = false}; } // with beta and gamma constexpr AscendC::NormalizeConfig CONFIG1 = GenConfig(false, false); constexpr AscendC::NormalizeConfig CONFIG2 = GenConfig(false, true); constexpr AscendC::NormalizeConfig CONFIG3 = GenConfig(true, false); constexpr AscendC::NormalizeConfig CONFIG4 = GenConfig(true, true); extern "C" __global__ __aicore__ void normalize_custom(GM_ADDR x, GM_ADDR mean, GM_ADDR variance, GM_ADDR gamma, GM_ADDR beta, GM_ADDR rstd, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling) { GET_TILING_DATA(tilingData, tiling); float epsilon = tilingData.espilon; AscendC::NormalizePara para(tilingData.aLength, tilingData.rLength, tilingData.rLengthWithPadding); if (TILING_KEY_IS(1)) { if (!tilingData.isNoBeta && !tilingData.isNoGamma) { KernelNormalize<CONFIG1> op; op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para); op.Process(); } else if (!tilingData.isNoBeta && tilingData.isNoGamma) { KernelNormalize<CONFIG2> op; op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para); op.Process(); } else if (tilingData.isNoBeta && !tilingData.isNoGamma) { KernelNormalize<CONFIG3> op; op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para); op.Process(); } else if (tilingData.isNoBeta && tilingData.isNoGamma) { KernelNormalize<CONFIG4> op; op.Init(x, mean, variance, gamma, beta, rstd, y, epsilon, para); op.Process(); } } } |