SoftmaxGrad

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

If the product of non-last axis lengths of the input tensor [m₀, m₁, ..., m_t, n] (t ≥ 0) is considered as m, the shape of the input tensor is [m, n]. This API performs gradient backward propagation on the input tensor [m, n]. Below is the formula.

$\text{[math]}$

When the input shape is in ND format, the internal reduce process is performed along the last axis. When the input shape is in NZ format, the internal reduce process is performed along the last and first axes. For details about the reduce process, see the figures in SoftMax.

For ease of understanding, the formula expressed through a Python script is as follows, where src, grad, and isFront are the source operands (input), and dst is the destination operand (output).

      
           def softmax_grad(grad, src, isFront = None):
    dst = grad x src
    dst = np.sum(dst, axis=-1, keepdims=True)
    if isFront :
         return dst
    dst = (grad - dst) x src
    return dst

Principles

The following figure shows the internal algorithm diagram of the SoftmaxGrad high-level APIs by taking the input tensor of the float type, in ND format, and with shape [m, k] as an example.

Figure 1 Diagram of the SoftmaxGrad algorithm

The computation process is divided into the following steps, all of which are performed on vectors:

mul: Multiply all data of input x and y. The computation result is saved to the temporary space temp.
reducesum: Sum up each row of temp data [m, k] to obtain [m, 1]. The computation result is saved to the temporary space.
broadcast: Pad the data [m, 1] obtained after reducesum by data block. For example, for the float type, extend [m, 1] to [m, 8].
Check whether the isFront mode is used. If yes, output the broadcast result and end the computation. If no, go to the next step.
broadcast: Extend [m, 8] to [m, k]. The computation result is saved to the temporary space.
Sub: Subtract all data of input x from the result obtained in the previous broadcast step.
mul: Multiply all data after sub with input y and output result z.

Prototype

Allocate the temporary space through the API framework.

        
             template <typename T, bool isReuseSource = false, bool isDataFormatNZ = false>
__aicore__ inline void SoftmaxGrad(const LocalTensor<T>& dstTensor, const LocalTensor<T>& gradTensor, const LocalTensor<T>& srcTensor, const SoftMaxTiling& tiling, bool isFront = false, const SoftMaxShapeInfo& softmaxShapeInfo = {})

Pass to the temporary space through the sharedTmpBuffer input parameter.

        
             template <typename T, bool isReuseSource = false, bool isDataFormatNZ = false>
__aicore__ inline void SoftmaxGrad(const LocalTensor<T>& dstTensor, const LocalTensor<T>& gradTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const SoftMaxTiling& tiling, bool isFront = false, const SoftMaxShapeInfo& softmaxShapeInfo = {})

Due to the complex computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetSoftMaxGradMaxTmpSize/GetSoftMaxGradMinTmpSize API provided in SoftmaxGrad Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the operand. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and float. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and float. For the Atlas 200I/500 A2 inference products , the supported data types are half and float. For the Atlas inference product 's AI Core, the supported data types are half and float.
isReuseSource	This parameter is reserved. Pass the default value false.
isDataFormatNZ	Whether the current input and output data is in NZ format. The default data format is ND, that is, the default value of this parameter is false. For the Atlas 200I/500 A2 inference products , the NZ format is not supported.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The length of the last axis must be 32-byte aligned. The shape of dstTensor is the same as those of gradTensor and srcTensor.

gradTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The length of the last axis must be 32-byte aligned. The shape of gradTensor is the same as those of dstTensor and srcTensor.

srcTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The length of the last axis must be 32-byte aligned. The shape of srcTensor is the same as those of dstTensor and gradTensor.

sharedTmpBuffer

Input

Temporary space.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The data type of this operand is fixed at uint8_t.

This parameter is used to store intermediate variables during complex internal API computation and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see SoftmaxGrad Tiling.

softmaxShapeInfo

Input

Shape of srcTensor, which is of the SoftMaxShapeInfo type. The specific definition is as follows:

           
                struct SoftMaxShapeInfo {
uint32_t srcM; // Product of lengths of non-last axes
uint32_t srcK; // Length of the last axis, which must be 32-byte aligned
uint32_t oriSrcM; // Product of lengths of original non-last axes
uint32_t oriSrcK; // Length of the original last axis
};

Note that when the input and output data is in NZ format, the last axis length is the length of the reduce axis, that is, W0 × W1 in Figure 2 and the length of each non-last axis is H0 × H1.

tiling

Input

Tiling information required for softmaxgrad computation. For details about how to obtain the tiling information, see SoftmaxGrad Tiling.

isFront

Input

Whether to enable isFront computation. If the value is True, the length of the last axis of dstTensor must be fixed at 32 bytes.

Returns

None

Restrictions

The tensor space of srcTensor and dstTensor can be reused.
For details about the operand address alignment requirements, see General Address Alignment Restrictions.
The address of sharedTmpBuffer must not overlap that of the source or destination operand.
When srcM ! is set to oriSrcM or srcK ! is set to oriSrcK in softmaxShapeInfo, for the original input (oriSrcM, oriSrcK) on the GM, you need to pad data to (srcM, srcK) in the M or K direction. The padded data will be involved in some computation. In the scenario where the input and output are reused, the computation result of the API will overwrite the original data padded to the srcTensor. In the scenario where the input and output are not reused, the computation result of the API will overwrite the data in dstTensor corresponding to the padded position of srcTensor.

Example

In this example, the shape size of the input srcTensor, gradtensor, and output dstTensor is [128, 64], isFront is false, the data type is half, and the input and output data layout is ND, the srcTensor and dstTensor spaces cannot be reused, and the base block is disabled. For more operator examples, see softmaxgrad operator sample.

       
        
          
          
            #include "kernel_operator.h"

template <typename T>
class KernelSoftmaxGrad {
public:
    __aicore__ inline KernelSoftmaxGrad()
    {}
    __aicore__ inline void Init(
        __gm__ uint8_t *src1Gm, __gm__ uint8_t *src2Gm, __gm__ uint8_t *dstGm, const SoftMaxTiling &tilingData)
    {
        elementNumPerBlk = 32 / sizeof(T);
        src1Global.SetGlobalBuffer((__gm__ T *)src1Gm);
        src2Global.SetGlobalBuffer((__gm__ T *)src2Gm);
        dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm);
        pipe.InitBuffer(inQueueSrc1, 1, height * width * sizeof(T));
        pipe.InitBuffer(inQueueSrc2, 1, height * width * sizeof(T));
        pipe.InitBuffer(outQueueDst, 1, height * width * sizeof(T));
        tiling = tilingData;
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal1 = inQueueSrc1.AllocTensor<T>();
        AscendC::LocalTensor<T> srcLocal2 = inQueueSrc2.AllocTensor<T>();
        AscendC::DataCopy(srcLocal1, src1Global, height * width);
        AscendC::DataCopy(srcLocal2, src2Global, height * width);
        inQueueSrc1.EnQue(srcLocal1);
        inQueueSrc2.EnQue(srcLocal2);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> srcLocal1 = inQueueSrc1.DeQue<T>();
        AscendC::LocalTensor<T> srcLocal2 = inQueueSrc2.DeQue<T>();
        AscendC::LocalTensor<T> dstLocal = outQueueDst.AllocTensor<T>();

        AscendC::SoftMaxShapeInfo srcShape = {height, width, height, width};
        AscendC::SoftmaxGrad<T>(dstLocal, srcLocal2, srcLocal1, tiling, false, srcShape);

        outQueueDst.EnQue<T>(dstLocal);
        inQueueSrc1.FreeTensor(srcLocal1);
        inQueueSrc2.FreeTensor(srcLocal2);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = outQueueDst.DeQue<T>();
        AscendC::DataCopy(dstGlobal, dstLocal, height * width);
        outQueueDst.FreeTensor(dstLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc2;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;

    AscendC::GlobalTensor<T> src1Global, src2Global, dstGlobal;
    uint32_t elementNumPerBlk = 0;
    uint32_t width = 64;
    uint32_t height = 128;
    SoftMaxTiling tiling;
};

extern "C" __global__ __aicore__ void softmax_grad_kernel_half(
    __gm__ uint8_t *src1Gm, __gm__ uint8_t *src2Gm, __gm__ uint8_t *dstGm, __gm__ uint8_t *tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelSoftmaxGrad<half> op;
    op.Init(src1Gm, src2Gm, dstGm, tilingData.softmaxTilingData);
    op.Process();
}

           

         

       
      

Parent topic: SoftMax APIs