LayerNormGradBeta

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Obtains the reverse beta and gamma values and outputs pdx, gamma, and beta when used in conjunction with LayerNormGrad.

The formulas are as follows:

$\text{[math]}$

Prototype

Due to the complex computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetLayerNormGradBetaMaxMinTmpSize API provided in LayerNormGradBeta Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the LayerNormGradBeta API.

Pass to the temporary space through the sharedTmpBuffer input parameter.

template <typename T, bool isReuseSource = false>
__aicore__ inline void LayerNormGradBeta(const LocalTensor<T>& outputPdGamma, const LocalTensor<T>& outputPdBeta, const LocalTensor<T>& resForGamma, const LocalTensor<T>& inputDy, const LocalTensor<uint8_t>& sharedTmpBuffer, const LayerNormGradBetaTiling& tiling)

This method enables developers to allocate and manage the temporary memory space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated or deallocated, improving the flexibility and buffer utilization.

Allocate the temporary space through the API framework.

template <typename T, bool isReuseSource = false>
__aicore__ inline void LayerNormGradBeta(const LocalTensor<T>& outputPdGamma, const LocalTensor<T>& outputPdBeta, const LocalTensor<T>& resForGamma, const LocalTensor<T>& inputDy, LayerNormGradBetaTiling& tiling)

When using this method, developers do not need to allocate the space, but must reserve the required temporary space size.

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the operand. For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half and float. For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half and float. For the Atlas inference product's AI Core, the supported data types are half and float.
isReuseSource	Whether the source operand can be modified. The default value is false. If developers allow the source operand to be modified, enable this parameter, to reduce memory space usage. If this parameter is set to true, the inputDy memory space is reused during internal computation of this API to save the memory space. If this parameter is set to false, the inputDy memory space is not reused. This parameter can be enabled for float data inputs but cannot be enabled for half data inputs. For details about how to use isReuseSource, see Example 4.

**Table 2** API parameters
Parameter	Input/Output	Description
outputPdGamma	Output	Destination operand, with a shape of [H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The length of the last axis must be 32-byte aligned. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
outputPdBeta	Output	Destination operand, with a shape of [H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The length of the last axis must be 32-byte aligned. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
resForGamma	Input	Source operand, with a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of resForGamma must be the same as that of the destination operand, and the last axis length must be 32-byte aligned. The LayerNormGrad API needs to be called in advance to obtain the value of resForGamma. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
inputDy	Input	Source operand, with a shape of [B, S, H]. For details about the definition of the LocalTensor data structure, see LocalTensor. The data type of inputDy must be the same as that of the destination operand, and the last axis length must be 32-byte aligned. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
sharedTmpBuffer	Input	Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see LayerNormGradBeta Tiling. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
tiling	Input	Tiling information required for LayerNormGradBeta computation. For details about how to obtain the tiling information, see LayerNormGradBeta Tiling.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
The tensor space of the source operand and destination operand can be reused.
The input shape must be in ND format.
If the input data does not meet the alignment requirements, developers need to pad the data. The padded data should be set to 0 to prevent abnormal values from affecting network computation.
The last axis (H axis) cannot be split.

Example

#include "kernel_operator.h"

template <typename T, bool isReuseSource = false>
class KernelLayernormGradBeta {
public:
    __aicore__ inline KernelLayernormGradBeta()
    {}
    __aicore__ inline void Init(__gm__ uint8_t *resForGammaGm, __gm__ uint8_t *inputDyGm,
        __gm__ uint8_t *outputPdGammaGm, __gm__ uint8_t *outputPdBetaGm, const LayerNormGradBetaTiling &tiling)
    {
        this->bLength = tiling.bLength;
        this->sLength = tiling.sLength;
        this->hLength = tiling.hLength;
        this->tiling = tiling;
        bshLength = bLength * sLength * hLength;
        bsLength = bLength * sLength;
        resForGammaGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(resForGammaGm), bshLength);
        inputDyGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(inputDyGm), bshLength);
        outputPdGammaGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(outputPdGammaGm), hLength);
        outputPdBetaGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(outputPdBetaGm), hLength);
        pipe.InitBuffer(inQueueResForGamma, 1, sizeof(T) * bshLength);
        pipe.InitBuffer(inQueueDy, 1, sizeof(T) * bshLength);
        pipe.InitBuffer(outQueuePdGamma, 1, sizeof(T) * hLength);
        pipe.InitBuffer(outQueuePdBeta, 1, sizeof(T) * hLength);
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> resForGammaLocal = inQueueResForGamma.AllocTensor<T>();
        AscendC::LocalTensor<T> inputDyLocal = inQueueDy.AllocTensor<T>();
        AscendC::DataCopy(resForGammaLocal, resForGammaGlobal, bshLength);
        AscendC::DataCopy(inputDyLocal, inputDyGlobal, bshLength);
        inQueueResForGamma.EnQue(resForGammaLocal);
        inQueueDy.EnQue(inputDyLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> resForGammaLocal = inQueueResForGamma.DeQue<T>();
        AscendC::LocalTensor<T> inputDyLocal = inQueueDy.DeQue<T>();
        AscendC::LocalTensor<T> outputPdGammaLocal = outQueuePdGamma.AllocTensor<T>();
        AscendC::LocalTensor<T> outputPdBetaLocal = outQueuePdBeta.AllocTensor<T>();

        AscendC::LayerNormGradBeta<T, isReuseSource>(
            outputPdGammaLocal, outputPdBetaLocal, resForGammaLocal, inputDyLocal, tiling);

        outQueuePdGamma.EnQue<T>(outputPdGammaLocal);
        outQueuePdBeta.EnQue<T>(outputPdBetaLocal);
        inQueueResForGamma.FreeTensor(resForGammaLocal);
        inQueueDy.FreeTensor(inputDyLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> outputPdGammaLocal = outQueuePdGamma.DeQue<T>();
        AscendC::LocalTensor<T> outputPdBetaLocal = outQueuePdBeta.DeQue<T>();
        AscendC::DataCopy(outputPdGammaGlobal, outputPdGammaLocal, hLength);
        AscendC::DataCopy(outputPdBetaGlobal, outputPdBetaLocal, hLength);
        outQueuePdGamma.FreeTensor(outputPdGammaLocal);
        outQueuePdBeta.FreeTensor(outputPdBetaLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueResForGamma, inQueueDy;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueuePdGamma, outQueuePdBeta;
    AscendC::GlobalTensor<T> resForGammaGlobal;
    AscendC::GlobalTensor<T> inputDyGlobal;
    AscendC::GlobalTensor<T> outputPdGammaGlobal;
    AscendC::GlobalTensor<T> outputPdBetaGlobal;
    uint32_t bLength;
    uint32_t sLength;
    uint32_t hLength;
    uint32_t bshLength;
    uint32_t bsLength;
    LayerNormGradBetaTiling tiling;
};

extern "C" __global__ __aicore__ void kernel_layernorm_grad_beta_operator(
    GM_ADDR outputPdGammaGm, GM_ADDR outputPdBetaGm, GM_ADDR resForGammaGm, GM_ADDR inputDyGm, GM_ADDR tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelLayernormGradBeta<half, false> op;
    op.Init(resForGammaGm, inputDyGm, outputPdGammaGm, outputPdBetaGm, tilingData.layerNormGradBetaTiling);
    op.Process();
}

Parent topic: Normalization