ConfusionTranspose
Function Usage
Performs data layout and reshape operations on input data. The specific functions are as follows:
[Scenario 1: NZ2ND, axis 1 and axis 2 interchanged]
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, S, N, H/N], origin_shape:[B, S, N, H/N], format:"ND", origin_format:"ND"}
[Scenario 2: NZ2NZ, axis 1 and axis 2 interchanged]
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, S, H/N/16, N/16, 16, 16], origin_shape:[B, S, N, H/N], format:"NZ", origin_format:"ND"}
[Scenario 3: NZ2NZ, last axis split]
Input Tensor { shape:[B, H / 16, S / 16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
[Scenario 4: NZ2ND, last axis split]
Input Tensor { shape:[B, H / 16, S / 16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, N, S, H/N], origin_shape:[B, N, S, H/N], format:"ND", origin_format:"ND"}
[Scenario 5: NZ2ND, last axis combination]
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, S, H], origin_shape:[B, S, H], format:"ND", origin_format:"ND"}
[Scenario 6: NZ2NZ, last axis combination]
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, H/16, S/16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}
[Scenario 7: two-dimensional transpose]
Two-dimensional tensors can be transposed on the UB. The values of H and W in srcShape are integer multiples of 16.
Principles
The algorithm block diagrams for the seven functional scenarios of ConfusionTranspose are shown in the following figures.
The computation process is as follows:
Perform cyclic processing in the H/N, N, and B directions in sequence.
- First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares along the S direction into temp, and store them consecutively in temp.
- Second TransDataTo5HD step: Transpose the S/16 16 × 16 squares from temp to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N) × N elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by 16 elements.
The computation process is as follows:
Perform cyclic processing in the H/N, N, and B directions in sequence.
- First TransDataTo5HD step: Take S/16 consecutive 16 × 16 squares along the S direction into temp, and store them consecutively in temp.
- Second TransDataTo5HD step: Transpose the S/16 16 × 16 squares from temp to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N) × N elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by N × 16 elements.
The computation process is as follows:
Perform cyclic processing in the H and B directions in sequence.
- First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares into temp1 each time.
- DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
- Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by S × 16 elements.
The computation process is as follows:
Perform cyclic processing in the H and B directions in sequence.
- First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares into temp1 each time.
- DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
- Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N + 16 – 1)/16 × 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by (H/N + 16 – 1)/16 × 16 × S elements.
The computation process is as follows:
Perform cyclic processing in the H and B directions in sequence.
- First TransDataTo5HD step: Transpose an S x 16 square to temp1 each time.
- DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
- Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H + 16 – 1)/16 × 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by H/N × S elements.
The computation process is as follows:
Perform cyclic processing in the H and B directions in sequence.
- First TransDataTo5HD step: Transpose an S x 16 square to temp1 each time.
- DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
- Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by S × 16 elements.
The computation process is as follows:
- Call TransDataTo5HD to transpose [H, W] to [W, H] by setting different source operand and destination operand address sequences. The format is ND in both src and dst.
Prototype
Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetConfusionTransposeMaxMinTmpSize API provided in ConfusionTranspose Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.
The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the ConfusionTranspose API.
- Pass the temporary space through the sharedTmpBuffer input parameter.
1 2
template <typename T> __aicore__ inline void ConfusionTranspose(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t> &sharedTmpBuffer, TransposeType transposeType, ConfusionTransposeTiling& tiling)
This method enables developers to allocate and manage the temporary memory space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
- Allocate the temporary space through the API framework.
1 2
template <typename T> __aicore__ inline void ConfusionTranspose(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, TransposeType transposeType, ConfusionTransposeTiling& tiling)
When using this method, developers do not need to allocate the space, but must reserve the required size for the space.
Parameters
|
Parameter |
Function |
|---|---|
|
T |
Data type of an operand. |
|
Parameter |
Input/Output |
Description |
||
|---|---|---|---|---|
|
dstTensor |
Output |
Destination operand, with a type of LocalTensor. For details about the definition of the LocalTensor data structure, see LocalTensor. |
||
|
srcTensor |
Input |
Source operand, with a type of LocalTensor. For details about the definition of the LocalTensor data structure, see LocalTensor. |
||
|
sharedTmpBuffer |
Input |
Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see ConfusionTranspose Tiling. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
|
transposeType |
Input |
Data layout and reshape type, which is the TransposeType enumeration class.
|
||
|
tiling |
Input |
Tiling information required for computation. For details about how to obtain the tiling information, see ConfusionTranspose Tiling. |
Returns
None
Availability
Precautions
- For details about the alignment requirements of the operand address offset, see General Restrictions.
Examples
This example is used in scenario 1 (NZ2ND, axis 1 and axis 2 interchanged).
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, S, N, H/N], ori_shape:[B, S, N, H/N], format:"ND", origin_format:"ND"}
B = 1, N = 2, S = 64, H/N = 32. The input data type is half.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
#include "kernel_operator.h" template <typename T> class KernelConfusionTranspose { public: __aicore__ inline KernelConfusionTranspose(){} __aicore__ inline void Init(__gm__ uint8_t *srcGm, __gm__ uint8_t *dstGm, const ConfusionTransposeTiling &tiling) { srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm, B * N * S * hnDiv); dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm, B * N * S * hnDiv); pipe.InitBuffer(inQueueSrcVecIn, 1, B * N * S * hnDiv * sizeof(T)); pipe.InitBuffer(inQueueSrcVecOut, 1, B * N * S * hnDiv * sizeof(T)); this->tiling = tiling; } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.AllocTensor<T>(); AscendC::DataCopy(srcLocal, srcGlobal, B * N * S * hnDiv); inQueueSrcVecIn.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.DeQue<T>(); AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.AllocTensor<T>(); AscendC::ConfusionTranspose(dstLocal, srcLocal, AscendC::TransposeType::TRANSPOSE_NZ2ND_0213, this->tiling); inQueueSrcVecOut.EnQue<T>(dstLocal); inQueueSrcVecIn.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.DeQue<T>(); AscendC::DataCopy(dstGlobal, dstLocal, B * N * S * hnDiv); inQueueSrcVecOut.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrcVecIn; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> inQueueSrcVecOut; AscendC::GlobalTensor<T> srcGlobal; AscendC::GlobalTensor<T> dstGlobal; uint32_t B = 1; uint32_t N = 2; uint32_t S = 64; uint32_t hnDiv = 32; ConfusionTransposeTiling tiling; }; extern "C" __global__ __aicore__ void confusion_transpose_custom( GM_ADDR src_gm, GM_ADDR dst_gm, GM_ADDR workspace, GM_ADDR tiling) { GET_TILING_DATA(tilingData, tiling); KernelConfusionTranspose<half> op; op.Init(src_gm, dst_gm, tilingData.confusionTransposeTilingData); op.Process(); } |