SoftmaxGradFront

Function Usage

If the product of non-last axis lengths of the input tensor [m0, m1, ..., mt, n] (t ≥ 0) is considered as m, the shape of the input tensor is [m, n]. This API performs GradFront backward propagation on input tensor[m, n]. The formula is as follows.

When the input shape is in ND format, the internal reduction process is performed along the last axis. When the input shape is in NZ format, the internal reduction process is performed along the last and first axes. For details about the reduction process, see the figures in SoftMax.

For ease of understanding, the formula expressed through a Python script is as follows, where dx and y are the source operands (input), and d is the destination operand (output).

1
2
3
4
5
6
7
8
def softmax_grad_front(dx, y, is_fp16=False):
    dx = dx.astype(np.float32)
    y = y.astype(np.float32)

    d = (dx * y).sum(axis=-1, keepdims=True)  ###[1024,1]
    if is_fp16:
    d = d.astype(np.float16)
    return d

Principles

The following figure shows the internal algorithm diagram of the SoftmaxGradFront high-level APIs by taking the input tensor of the float type, in ND format, and with shape [m, k] as an example.

Figure 1 Diagram of the SoftmaxGradFront algorithm

The computation process is divided into the following steps, all of which are performed on vectors:

  1. mul: Multiply all data of input x and y. The computation result is saved to a temporary space temp.
  2. reducesum: Sum up each row of temp data [m, k] to obtain [m, 1]. The computation result is saved to the temporary space.
  3. broadcast: Pad [m, 1] by data block. For example, for the float type, extend [m, 1] to [m, 8] and output z.

Prototype

  • Allocate the temporary space through the API framework.
    1
    2
    template <typename T, bool isBasicBlock = false, bool isDataFormatNZ = false>
    __aicore__ inline void SoftmaxGradFront(const LocalTensor<T>& dstTensor, const LocalTensor<T>& gradTensor, const LocalTensor<T>& srcTensor, const SoftMaxTiling& tiling, const SoftMaxShapeInfo& softmaxShapeInfo = {})
    
  • Pass the temporary space through the sharedTmpBuffer input parameter.
    1
    2
    template <typename T, bool isBasicBlock = false, bool isDataFormatNZ = false>
    __aicore__ inline void SoftmaxGradFront(const LocalTensor<T>& dstTensor, const LocalTensor<T>& gradTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const SoftMaxTiling& tiling, const SoftMaxShapeInfo& softmaxShapeInfo = {})
    

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

  • When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.
  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for the tensor. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetSoftMaxGradMaxTmpSize/GetSoftMaxGradMinTmpSize API provided in SoftmaxGrad Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the operand.

isBasicBlock

If the shape information and tiling strategy of both srcTensor and gradTensor meet the base block requirements, this parameter can be enabled to improve performance. By default, this parameter is disabled. Use either of the following methods to determine whether the base block requirements are met:

  • The shape information [m, n] of srcTensor and dstTensor must meet the following requirements:
    • The last axis length n is less than 2048 and greater than or equal to 256/sizeof(T). That is, the minimum value of n is 128 when the data type is half and 64 when the data type is float. In addition, n is a multiple of 64.
    • The product m of non-last axis lengths is a multiple of 8.
  • You can call IsBasicBlockInSoftMax to check whether the tiling strategy meets the tiling requirements of base blocks.

isDataFormatNZ

Whether the current input and output data is in NZ format. The default data format is ND, that is, the default value of this parameter is false.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The length of the last axis is fixed at 32 bytes, that is, the length of a data block, and all numbers in this data block possess an identical value. For example, in the half data type, all 16 numbers in this data block possess an identical value, and the length of the non-last axis is the same as that of srcTensor.

gradTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The length of the last axis must be 32-byte aligned.

srcTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The length of the last axis must be 32-byte aligned.

sharedTmpBuffer

Input

Temporary space.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The data type of this operand is fixed at uint8_t.

This parameter is used to store intermediate variables during complex computation and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see SoftmaxGrad Tiling.

tiling

Input

Tiling information required for SoftmaxGradFront computation. For details about how to obtain the tiling information, see SoftmaxGrad Tiling.

softmaxShapeInfo

Input

Shape of srcTensor, SoftMaxShapeInfo type. The specific definition is as follows:

1
2
3
4
5
6
struct SoftMaxShapeInfo {
uint32_t srcM; // Product of lengths of non-last axes.
uint32_t srcK; // Length of the last axis, which must be 32-byte aligned.
uint32_t oriSrcM; // Product of lengths of original non-last axes.
uint32_t oriSrcK; // Length of the original last axis.
};

Note that when the input and output data is in NZ format, the last axis length is the length of the reduce axis, that is, W0 × W1 in Figure 2 and the length of each non-last axis is H0 × H1.

Returns

None

Availability

Precautions

  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

In this example, the shape size of the input srcTensor and gradtensor is [128, 64], the shape size of the output dstTensor is [128, 16], the data type is half, the format of the input and output data is ND, and the base block is disabled.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#include "kernel_operator.h"

template <typename T> class KernelSoftmaxGrad {
public:
    __aicore__ inline KernelSoftmaxGrad() {}
    __aicore__ inline void Init(__gm__ uint8_t* src1Gm, __gm__ uint8_t* src2Gm, __gm__ uint8_t* dstGm, const SoftMaxTiling& tilingData)
    {
        elementNumPerBlk = 32 / sizeof(T);
        src1Global.SetGlobalBuffer((__gm__ T*)src1Gm);
        src2Global.SetGlobalBuffer((__gm__ T*)src2Gm);
        dstGlobal.SetGlobalBuffer((__gm__ T*)dstGm);
        pipe.InitBuffer(inQueueSrc1, 1, height * width * sizeof(T));
        pipe.InitBuffer(inQueueSrc2, 1, height * width * sizeof(T));
        pipe.InitBuffer(outQueueDst, 1, height * elementNumPerBlk * sizeof(T));
        tiling = tilingData;
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal1 = inQueueSrc1.AllocTensor<T>();
        AscendC::LocalTensor<T> srcLocal2 = inQueueSrc2.AllocTensor<T>();
        AscendC::DataCopy(srcLocal1, src1Global, height * width);
        AscendC::DataCopy(srcLocal2, src2Global, height * width);
        inQueueSrc1.EnQue(srcLocal1);
        inQueueSrc2.EnQue(srcLocal2);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> srcLocal1 = inQueueSrc1.DeQue<T>();
        AscendC::LocalTensor<T> srcLocal2 = inQueueSrc2.DeQue<T>();
        AscendC::LocalTensor<T> dstLocal = outQueueDst.AllocTensor<T>();
        AscendC::SoftMaxShapeInfo srcShape = { height, width, height, width };
        AscendC::SoftmaxGradFront<T>(dstLocal, srcLocal2, srcLocal1, tiling, srcShape);
        outQueueDst.EnQue<T>(dstLocal);
        inQueueSrc1.FreeTensor(srcLocal1);
        inQueueSrc2.FreeTensor(srcLocal2);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = outQueueDst.DeQue<T>();
        AscendC::DataCopy(dstGlobal, dstLocal, height * elementNumPerBlk);
        outQueueDst.FreeTensor(dstLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc1;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc2;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<T> src1Global, src2Global, dstGlobal;
    uint32_t elementNumPerBlk = 0;
    uint32_t width = 64;
    uint32_t height = 128;
    SoftMaxTiling tiling;
};

extern "C" __global__ __aicore__ void softmax_grad_kernel_half(__gm__ uint8_t* src1Gm, __gm__ uint8_t* src2Gm, __gm__ uint8_t* dstGm, __gm__ uint8_t* tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelSoftmaxGrad<half> op;
    op.Init(src1Gm, src2Gm, dstGm, tilingData.softmaxTilingData);
    op.Process();
}