LayerNormGrad

Function Usage

LayerNormGrad is a function used to calculate the backpropagation gradient of LayerNorm. If this API is used alone, it generates outputs x and resForGamma. When combined with LayerNormGradBeta, resForGamma that this API produces can be passed to LayerNormGradBeta to generate the gamma and beta outputs. Therefore, you can simultaneously obtain x, gamma, and beta by using these two APIs.

The formulas are as follows:

1
2
3
4
5
pd_xl(BSH) = data_dy * data_gamma
pd_var(H) = np.sum(((-0.5) * pd_xl * (data_x - data_mean) * np.power((data_variance + EPSLON), (-1.5))), reduce_axis, keepdims=True)
pd_mean(BS1) = np.sum(((-1.0) * pd_xl * np.power((data_variance + EPSLON), (-0.5))), reduce_axis, keepdims=True) + pd_var * (1.0 / H) * np.sum(((-2.0) * (data_x - data_mean)), reduce_axis, keepdims=True)
pd_x(BSH) = pd_xl * np.power((data_variance + EPSLON), (-0.5)) + pd_var * (2.0 / H) * (data_x - data_mean) + pd_mean * (1.0 / H)
res_for_gamma(BSH) = (data_x - data_mean) * np.power((data_variance + EPSLON), (-0.5))

Principles

The figure below illustrates the internal algorithm block diagram of LayerNormGrad high-level APIs, taking the float type, ND format, and inputs inputDy[B, S, H], inputX[B, S, H], inputVariance[B, S], inputMean[B, S], and inputGamma[H] as examples.

Figure 1 LayerNormGrad algorithm block diagram

The computation process is divided into the following steps, all of which are performed on vectors:

  1. ComputePdX1: Calculate inputDy*inputGamma and store the result to x1Tensor.
  2. ComputePdX2: Extend the shape of inputMean to [B, S, H] through Brcb, calculate inputX-inputMean, and store the result to x2Tensor.
  3. ComputePdVar: Implement the calculation of the np.sum(((-0.5) * x1Tensor * x2Tensor * np.power((inputVariace + EPSLON), (-1.5)))) formula, with the power method implemented by combining the Ln, Exp, and Muls basic APIs. Store the result to pdVarTensor.
  4. ComputePdMean: Implement the calculation of the np.sum(((-1.0) * x1Tensor * np.power((inputVariace + EPSLON), (-0.5)))) + pd_var * (1.0 / H) * np.sum(((-2.0) * (x2Tensor))) formula, with the power method implemented by combining the Ln, Exp, and Muls basic APIs. Store the result to pdMeanTensor. At the same time, use the intermediate calculation result to calculate the resForGamma result according to the x2Tensor * np.power((inputVariace + EPSLON), (-1.5)) formula.
  5. ComputePdX: Implement the calculation of the x1Tensor * np.power((inputVariace + EPSLON), (-0.5)) + pd_var*(2.0 / H)*(x2Tensor) + pd_mean*(1.0 / H) formula and store the result to outputPdX.

Prototype

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetLayerNormGradMaxMinTmpSize API provided in LayerNormGrad Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the LayerNormGrad API.

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    1
    2
    template <typename T, bool isReuseSource = false>
    __aicore__ inline void LayerNormGrad(const LocalTensor<T> &outputPdX, const LocalTensor<T> &resForGamma, const LocalTensor<T> &inputDy, const LocalTensor<T> &inputX, const LocalTensor<T> &inputVariance, const LocalTensor<T> &inputMean, const LocalTensor<T> &inputGamma, LocalTensor<uint8_t> &sharedTmpBuffer, T epsilon, LayerNormGradTiling &tiling, const LayerNormGradShapeInfo &shapeInfo = {})
    

    This method enables developers to allocate and manage the temporary buffer space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

  • Allocate the temporary space through the API framework.
    1
    2
    template <typename T, bool isReuseSource = false>
    __aicore__ inline void LayerNormGrad(const LocalTensor<T> &outputPdX, const LocalTensor<T> &resForGamma, const LocalTensor<T> &inputDy, const LocalTensor<T> &inputX, const LocalTensor<T> &inputVariance, const LocalTensor<T> &inputMean, const LocalTensor<T> &inputGamma, T epsilon, LayerNormGradTiling &tiling, const LayerNormGradShapeInfo &shapeInfo = {})
    

    When using this method, developers do not need to allocate the space, but must reserve the required size for the space.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the operand.

isReuseSource

Whether the source operand can be modified. The default value is false. If developers permit the modification of the source operand, enable this parameter, and this can save certain buffer space.

If this parameter is set to true, the buffer space of inputX is reused during internal computation of this API to save the buffer space. If this parameter is set to false, the buffer space of inputX is not reused during internal computation of this API.

This parameter can be enabled for float data inputs but cannot be enabled for half data inputs.

For details about how to use isReuseSource, see More Examples.

Table 2 API parameters

Parameter

Input/Output

Meaning

outputPdX

Output

Destination operand, with a type of LocalTensor and a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The last axis length must be 32-byte aligned.

resForGamma

Output

Destination operand, with a type of LocalTensor and a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The last axis length must be 32-byte aligned.

inputDy

Input

Source operand, with a type of LocalTensor and a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputDy must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.

inputX

Input

Source operand, with a type of LocalTensor and a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputX must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.

inputVariance

Input

Variance, with a type of LocalTensor and a shape of [B, S]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputVariance must be the same as that of the destination operand, and the last axis length must be 32-byte aligned. The LayerNorm API needs to be called in advance to obtain the variance.

inputMean

Input

Mean, with a type of LocalTensor and a shape of [B, S]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputMean must be the same as that of the destination operand, and the last axis length must be 32-byte aligned. The LayerNorm API needs to be called in advance to obtain the mean.

inputGamma

Input

Source operand, with a type of LocalTensor and a shape of [H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputGamma must be the same as that of the destination operand, and the last axis length must be 32-byte aligned.

sharedTmpBuffer

Input

Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

epsilon

Input

Weight coefficient for preventing division by zero.

tiling

Input

Tiling information required for LayerNormGrad computation.

shapeInfo

Input

Layout format of each input data of LayerNormGrad. The default value indicates that the input format is ND. The value can be DataFormat::ND. LayerNormGradShapeInfo type. The specific definition is as follows:

1
2
3
struct LayerNormGradShapeInfo {
    DataFormat dataFormat = DataFormat::ND;
};

Returns

None

Availability

Constraints

  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • The tensor space of src and dst can be reused.
  • The input shape must be in ND format.
  • If the input data does not meet the alignment requirements, developers need to pad the data. The padded data should be set to 0 to prevent abnormal values from affecting network computation.
  • The last axis (H axis) cannot be split.

Example

In this sample, the shape of inputX and inputDy is [2, 32, 16], the shape of inputVariance and inputMean is [2, 32], and the shape of inputGamma is [16]. The shape of outputPdX and resForGamma is [2, 32, 16]. The data layout is in ND format and the data type is float. The memory space of the source operand is not reused.

For details about the complete example, see layernorm_grad.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
#include "kernel_operator.h"

namespace MyCustomKernel {
struct VecTiling {
    LayerNormGradTiling layernormGradTilingData;
    float epsilon = 0;
};

template <bool isReuseSource = false> class KernelLayernormGrad {
public:
    __aicore__ inline KernelLayernormGrad() {}
    __aicore__ inline void Init(GM_ADDR inputXGm, GM_ADDR inputDyGm, GM_ADDR inputVarianceGm, GM_ADDR inputMeanGm,
        GM_ADDR inputGammaGm, GM_ADDR outputPdXGm, GM_ADDR resForGammaGm, VecTiling tilingData)
    {
        this->epsilon = tilingData.epsilon;
        tiling_ = tilingData.layernormGradTilingData;
        this->bLength = tiling_.bLength;
        this->sLength = tiling_.sLength;
        this->hLength = tiling_.hLength;
        bshLength = bLength * sLength * hLength;
        bsLength = bLength * sLength;
        inputXGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float *>(inputXGm), bshLength);
        inputDyGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float *>(inputDyGm), bshLength);
        inputVarianceGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float *>(inputVarianceGm), bsLength);
        inputMeanGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float *>(inputMeanGm), bsLength);
        inputGammaGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float *>(inputGammaGm), hLength);
        outputPdXGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float *>(outputPdXGm), bshLength);
        outputResForGammaGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ float *>(resForGammaGm), bshLength);
        pipe.InitBuffer(inQueueX, 1, sizeof(float) * bshLength);
        pipe.InitBuffer(inQueueDy, 1, sizeof(float) * bshLength);
        pipe.InitBuffer(inQueueVariance, 1, sizeof(float) * bsLength);
        pipe.InitBuffer(inQueueMean, 1, sizeof(float) * bsLength);
        pipe.InitBuffer(inQueueGamma, 1, sizeof(float) * hLength);
        pipe.InitBuffer(outQueuePdX, 1, sizeof(float) * bshLength);
        pipe.InitBuffer(outQueueResForGamma, 1, sizeof(float) * bshLength);
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<float> inputXLocal = inQueueX.AllocTensor<float>();
        AscendC::LocalTensor<float> inputDyLocal = inQueueDy.AllocTensor<float>();
        AscendC::LocalTensor<float> inputVarianceLocal = inQueueVariance.AllocTensor<float>();
        AscendC::LocalTensor<float> inputMeanLocal = inQueueMean.AllocTensor<float>();
        AscendC::LocalTensor<float> inputGammaLocal = inQueueGamma.AllocTensor<float>();
        
        AscendC::DataCopy(inputXLocal, inputXGlobal, bshLength);
        AscendC::DataCopy(inputDyLocal, inputDyGlobal, bshLength);
        AscendC::DataCopy(inputVarianceLocal, inputVarianceGlobal, bsLength);
        AscendC::DataCopy(inputMeanLocal, inputMeanGlobal, bsLength);
        AscendC::DataCopy(inputGammaLocal, inputGammaGlobal, hLength);
        inQueueX.EnQue(inputXLocal);
        inQueueDy.EnQue(inputDyLocal);
        inQueueVariance.EnQue(inputVarianceLocal);
        inQueueMean.EnQue(inputMeanLocal);
        inQueueGamma.EnQue(inputGammaLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<float> inputXLocal = inQueueX.DeQue<float>();
        AscendC::LocalTensor<float> inputDyLocal = inQueueDy.DeQue<float>();
        AscendC::LocalTensor<float> inputVarianceLocal = inQueueVariance.DeQue<float>();
        AscendC::LocalTensor<float> inputMeanLocal = inQueueMean.DeQue<float>();
        AscendC::LocalTensor<float> inputGammaLocal = inQueueGamma.DeQue<float>();
        AscendC::LocalTensor<float> outputPdXLocal = outQueuePdX.AllocTensor<float>();
        AscendC::LocalTensor<float> outputResForGammaLocal = outQueueResForGamma.AllocTensor<float>();
        AscendC::LayerNormGrad<float, isReuseSource>(outputPdXLocal, outputResForGammaLocal, 
            inputDyLocal, inputXLocal, inputVarianceLocal, inputMeanLocal, inputGammaLocal, 
            (float)epsilon, tiling_, {DataFormat::ND});
        outQueuePdX.EnQue(outputPdXLocal);
        outQueueResForGamma.EnQue(outputResForGammaLocal);
        inQueueX.FreeTensor(inputXLocal);
        inQueueDy.FreeTensor(inputDyLocal);
        inQueueVariance.FreeTensor(inputVarianceLocal);
        inQueueMean.FreeTensor(inputMeanLocal);
        inQueueGamma.FreeTensor(inputGammaLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<float> outputPdXLocal = outQueuePdX.DeQue<float>();
        AscendC::LocalTensor<float> outputResForGammaLocal = outQueueResForGamma.DeQue<float>();
        AscendC::DataCopy(outputPdXGlobal, outputPdXLocal, bshLength);
        AscendC::DataCopy(outputResForGammaGlobal, outputResForGammaLocal, bshLength);
        outQueuePdX.FreeTensor(outputPdXLocal);
        outQueueResForGamma.FreeTensor(outputResForGammaLocal);
    }
private:
    AscendC::GlobalTensor<float> inputXGlobal;
    AscendC::GlobalTensor<float> inputDyGlobal;
    AscendC::GlobalTensor<float> inputVarianceGlobal;
    AscendC::GlobalTensor<float> inputMeanGlobal;
    AscendC::GlobalTensor<float> inputGammaGlobal;
    AscendC::GlobalTensor<float> outputPdXGlobal;
    AscendC::GlobalTensor<float> outputResForGammaGlobal;
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueDy;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueVariance;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueMean;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueGamma;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueuePdX;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueResForGamma;
    uint32_t bLength;
    uint32_t sLength;
    uint32_t hLength;
    float epsilon;
    LayerNormGradTiling tiling_;
    uint32_t bshLength;
    uint32_t bsLength;
};
}

extern "C" __global__ __aicore__ void layernorm_grad_custom(GM_ADDR inputXGm, GM_ADDR inputDyGm, GM_ADDR inputVarianceGm, 
    GM_ADDR inputMeanGm, GM_ADDR inputGammaGm, GM_ADDR outputPdXGm, GM_ADDR resForGammaGm, 
    GM_ADDR workspace, GM_ADDR tiling)
{
    if ASCEND_IS_AIC {
        return;
    }
    MyCustomKernel::VecTiling tilingData;
    CopyTiling(&tilingData, tiling);
    MyCustomKernel::KernelLayernormGrad<false> op;
    op.Init(inputXGm, inputDyGm, inputVarianceGm, inputMeanGm, inputGammaGm, outputPdXGm, resForGammaGm, tilingData);
    op.Process();
}