UnPad

Function Usage

Unpads a two-dimensional tensor (height * width) in the width direction. If the width of the tensor is not 32-byte aligned, this API should not be called for unpadding. This API is used in the following scenario: The width of the tensor is already 32-byte aligned. Take half as an example, for example, 16 x 16, and unpad it to 16 x 15.

Prototype

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetUnPadMaxMinTmpSize API provided in UnPad Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the UnPad API.

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    1
    2
    template <typename T>
    __aicore__ inline void UnPad(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, UnPadParams &unPadParams, LocalTensor<uint8_t> &sharedTmpBuffer, UnPadTiling &tiling)
    

    This method enables developers to allocate and manage the temporary memory space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

  • Allocate the temporary space through the API framework.
    1
    2
    template <typename T>
    __aicore__ inline void UnPad(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, UnPadParams &unPadParams, UnPadTiling &tiling)
    

    When using this method, developers do not need to allocate the space, but must reserve the required size for the space.

Parameters

Table 1 Parameters in the template

Parameter

Function

T

Data type of the operand.

Table 2 API parameters

Parameter

Input/Output

Meaning

dstTensor

Output

Destination operand, with a type of LocalTensor and a two-dimensional shape. For details about the definition of the LocalTensor data structure, see LocalTensor.

srcTensor

Input

Source operand, with a type of LocalTensor and a two-dimensional shape. For details about the definition of the LocalTensor data structure, see LocalTensor.

UnPadParams

Input

Detailed UnPad parameters of the UnPadParams data type. The specific parameters of the structure are as follows:

  • leftPad: data volume for left unpadding. The value of leftPad must be less than 32 bytes. The unit is column. This parameter does not take effect currently.
  • rightPad: data volume for right unpadding. The value of rightPad must be less than 32 bytes. The unit is column. Currently, the unpadding operation can be performed only on the right.

The definition of the UnPadParams structure is as follows:

1
2
3
4
struct UnPadParams {
    uint16_t leftPad = 0;
    uint16_t rightPad = 0;
};

sharedTmpBuffer

Input

Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see UnPad Tiling.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

tiling

Input

Tiling information required for computation. For details about how to obtain the tiling information, see UnPad Tiling.

Returns

None

Availability

Precautions

  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

In this sample, the width of the tensor is already 32-byte aligned. Take half as an example: For instance, 16 x 16 becomes 16 x 15 after unpadding. The input data type is half.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#include "kernel_operator.h"

template <typename T>
class KernelUnPad {
public:
    __aicore__ inline KernelUnPad()
    {}
    __aicore__ inline void Init(GM_ADDR dstGm, GM_ADDR srcGm, uint16_t heightIn, uint16_t widthIn, uint16_t oriWidthIn,
        AscendC::UnPadParams &unPadParamsIn, const UnPadTiling &tilingData)
    {
        height = heightIn;
        width = widthIn;
        oriWidth = oriWidthIn;
        unPadParams = unPadParamsIn;
        srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm);
        pipe.InitBuffer(inQueueSrcVecIn, 1, height * width * sizeof(T));
        pipe.InitBuffer(inQueueSrcVecOut, 1, height * (width - unPadParams.leftPad - unPadParams.rightPad) * sizeof(T));
        tiling = tilingData;
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, srcGlobal, height * width);
        inQueueSrcVecIn.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> dstLocal = inQueueSrcVecIn.DeQue<T>();
        AscendC::LocalTensor<T> srcOutLocal = inQueueSrcVecOut.AllocTensor<T>();
        AscendC::UnPad(srcOutLocal, dstLocal, unPadParams, tiling);
        inQueueSrcVecOut.EnQue(srcOutLocal);
        inQueueSrcVecIn.FreeTensor(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> srcOutLocalDe = inQueueSrcVecOut.DeQue<T>();
        AscendC::DataCopy(dstGlobal, srcOutLocalDe, height * (width - unPadParams.leftPad - unPadParams.rightPad));
        inQueueSrcVecOut.FreeTensor(srcOutLocalDe);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrcVecIn;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> inQueueSrcVecOut;
    AscendC::GlobalTensor<T> srcGlobal;
    AscendC::GlobalTensor<T> dstGlobal;
    uint16_t height;
    uint16_t width;
    uint16_t oriWidth;
    AscendC::UnPadParams unPadParams;
    UnPadTiling tiling;
};

extern "C" __global__ __aicore__ void
    kernel_unpad_half_16_16_16(GM_ADDR src_gm, GM_ADDR dst_gm, __gm__ uint8_t *tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelUnPad<half> op;
    AscendC::UnPadParams unPadParams{0, 1};
    op.Init(dst_gm, src_gm, 16, 16, 16, unPadParams, tilingData.unpadTilingData);
    op.Process();
}