UnPad
Function Usage
Unpads a two-dimensional tensor (height * width) in the width direction. If the width of the tensor is not 32-byte aligned, this API should not be called for unpadding. This API is used in the following scenario: The width of the tensor is already 32-byte aligned. Take half as an example, for example, 16 x 16, and unpad it to 16 x 15.
Prototype
Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetUnPadMaxMinTmpSize API provided in UnPad Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.
The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the UnPad API.
- Pass the temporary space through the sharedTmpBuffer input parameter.
1 2
template <typename T> __aicore__ inline void UnPad(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, UnPadParams &unPadParams, LocalTensor<uint8_t> &sharedTmpBuffer, UnPadTiling &tiling)
This method enables developers to allocate and manage the temporary memory space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
- Allocate the temporary space through the API framework.
1 2
template <typename T> __aicore__ inline void UnPad(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, UnPadParams &unPadParams, UnPadTiling &tiling)
When using this method, developers do not need to allocate the space, but must reserve the required size for the space.
Parameters
Parameter |
Function |
|---|---|
T |
Data type of the operand. |
Parameter |
Input/Output |
Meaning |
||
|---|---|---|---|---|
dstTensor |
Output |
Destination operand, with a type of LocalTensor and a two-dimensional shape. For details about the definition of the LocalTensor data structure, see LocalTensor. |
||
srcTensor |
Input |
Source operand, with a type of LocalTensor and a two-dimensional shape. For details about the definition of the LocalTensor data structure, see LocalTensor. |
||
UnPadParams |
Input |
Detailed UnPad parameters of the UnPadParams data type. The specific parameters of the structure are as follows:
The definition of the UnPadParams structure is as follows:
|
||
sharedTmpBuffer |
Input |
Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see UnPad Tiling. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
tiling |
Input |
Tiling information required for computation. For details about how to obtain the tiling information, see UnPad Tiling. |
Returns
None
Availability
Precautions
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Example
In this sample, the width of the tensor is already 32-byte aligned. Take half as an example: For instance, 16 x 16 becomes 16 x 15 after unpadding. The input data type is half.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | #include "kernel_operator.h" template <typename T> class KernelUnPad { public: __aicore__ inline KernelUnPad() {} __aicore__ inline void Init(GM_ADDR dstGm, GM_ADDR srcGm, uint16_t heightIn, uint16_t widthIn, uint16_t oriWidthIn, AscendC::UnPadParams &unPadParamsIn, const UnPadTiling &tilingData) { height = heightIn; width = widthIn; oriWidth = oriWidthIn; unPadParams = unPadParamsIn; srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm); dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm); pipe.InitBuffer(inQueueSrcVecIn, 1, height * width * sizeof(T)); pipe.InitBuffer(inQueueSrcVecOut, 1, height * (width - unPadParams.leftPad - unPadParams.rightPad) * sizeof(T)); tiling = tilingData; } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.AllocTensor<T>(); AscendC::DataCopy(srcLocal, srcGlobal, height * width); inQueueSrcVecIn.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<T> dstLocal = inQueueSrcVecIn.DeQue<T>(); AscendC::LocalTensor<T> srcOutLocal = inQueueSrcVecOut.AllocTensor<T>(); AscendC::UnPad(srcOutLocal, dstLocal, unPadParams, tiling); inQueueSrcVecOut.EnQue(srcOutLocal); inQueueSrcVecIn.FreeTensor(dstLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<T> srcOutLocalDe = inQueueSrcVecOut.DeQue<T>(); AscendC::DataCopy(dstGlobal, srcOutLocalDe, height * (width - unPadParams.leftPad - unPadParams.rightPad)); inQueueSrcVecOut.FreeTensor(srcOutLocalDe); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrcVecIn; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> inQueueSrcVecOut; AscendC::GlobalTensor<T> srcGlobal; AscendC::GlobalTensor<T> dstGlobal; uint16_t height; uint16_t width; uint16_t oriWidth; AscendC::UnPadParams unPadParams; UnPadTiling tiling; }; extern "C" __global__ __aicore__ void kernel_unpad_half_16_16_16(GM_ADDR src_gm, GM_ADDR dst_gm, __gm__ uint8_t *tiling) { GET_TILING_DATA(tilingData, tiling); KernelUnPad<half> op; AscendC::UnPadParams unPadParams{0, 1}; op.Init(dst_gm, src_gm, 16, 16, 16, unPadParams, tilingData.unpadTilingData); op.Process(); } |