ConfusionTranspose

Function Usage

Performs data layout and reshape operations on input data. The specific functions are as follows:

[Scenario 1: NZ2ND, axis 1 and axis 2 interchanged]

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, S, N, H/N], origin_shape:[B, S, N, H/N], format:"ND", origin_format:"ND"}

Figure 1 Data layout and reshape in scenario 1

[Scenario 2: NZ2NZ, axis 1 and axis 2 interchanged]

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, S, H/N/16, N/16, 16, 16], origin_shape:[B, S, N, H/N], format:"NZ", origin_format:"ND"}

Figure 2 Data layout and reshape in scenario 2

[Scenario 3: NZ2NZ, last axis split]

Input Tensor { shape:[B, H / 16, S / 16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Figure 3 Data layout and reshape in scenario 3

[Scenario 4: NZ2ND, last axis split]

Input Tensor { shape:[B, H / 16, S / 16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, N, S, H/N], origin_shape:[B, N, S, H/N], format:"ND", origin_format:"ND"}

Figure 4 Data layout and reshape in scenario 4

[Scenario 5: NZ2ND, last axis combination]

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, S, H], origin_shape:[B, S, H], format:"ND", origin_format:"ND"}

Figure 5 Data layout and reshape in scenario 5

[Scenario 6: NZ2NZ, last axis combination]

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, H/16, S/16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}

Figure 6 Data layout and reshape in scenario 6

[Scenario 7: two-dimensional transpose]

Two-dimensional tensors can be transposed on the UB. The values of H and W in srcShape are integer multiples of 16.

Figure 7 Data layout and reshape in scenario 7

Principles

The algorithm block diagrams for the seven functional scenarios of ConfusionTranspose are shown in the following figures.

Figure 8 Scenario 1: NZ2ND, axis 1 and axis 2 interchanged

The computation process is as follows:

Perform cyclic processing in the H/N, N, and B directions in sequence.

  1. First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares along the S direction into temp, and store them consecutively in temp.
  2. Second TransDataTo5HD step: Transpose the S/16 16 × 16 squares from temp to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N) × N elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by 16 elements.
Figure 9 Scenario 2: NZ2NZ, axis 1 and axis 2 interchanged

The computation process is as follows:

Perform cyclic processing in the H/N, N, and B directions in sequence.

  1. First TransDataTo5HD step: Take S/16 consecutive 16 × 16 squares along the S direction into temp, and store them consecutively in temp.
  2. Second TransDataTo5HD step: Transpose the S/16 16 × 16 squares from temp to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N) × N elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by N × 16 elements.
Figure 10 Scenario 3: NZ2NZ, last axis split

The computation process is as follows:

Perform cyclic processing in the H and B directions in sequence.

  1. First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares into temp1 each time.
  2. DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
  3. Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by S × 16 elements.
Figure 11 Scenario 4: NZ2ND, last axis split

The computation process is as follows:

Perform cyclic processing in the H and B directions in sequence.

  1. First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares into temp1 each time.
  2. DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
  3. Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N + 16 – 1)/16 × 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by (H/N + 16 – 1)/16 × 16 × S elements.
Figure 12 Scenario 5: NZ2ND, last axis combination

The computation process is as follows:

Perform cyclic processing in the H and B directions in sequence.

  1. First TransDataTo5HD step: Transpose an S x 16 square to temp1 each time.
  2. DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
  3. Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H + 16 – 1)/16 × 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by H/N × S elements.
Figure 13 Scenario 6: NZ2NZ, last axis combination

The computation process is as follows:

Perform cyclic processing in the H and B directions in sequence.

  1. First TransDataTo5HD step: Transpose an S x 16 square to temp1 each time.
  2. DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
  3. Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by S × 16 elements.
Figure 14 Scenario 7: two-dimensional transpose

The computation process is as follows:

  1. Call TransDataTo5HD to transpose [H, W] to [W, H] by setting different source operand and destination operand address sequences. The format is ND in both src and dst.

Prototype

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetConfusionTransposeMaxMinTmpSize API provided in ConfusionTranspose Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the ConfusionTranspose API.

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    1
    2
    template <typename T>
    __aicore__ inline void ConfusionTranspose(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t> &sharedTmpBuffer, TransposeType transposeType, ConfusionTransposeTiling& tiling)
    

    This method enables developers to allocate and manage the temporary memory space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

  • Allocate the temporary space through the API framework.
    1
    2
    template <typename T>
    __aicore__ inline void ConfusionTranspose(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, TransposeType transposeType, ConfusionTransposeTiling& tiling)
    

    When using this method, developers do not need to allocate the space, but must reserve the required size for the space.

Parameters

Table 1 Template parameters

Parameter

Function

T

Data type of an operand.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand, with a type of LocalTensor. For details about the definition of the LocalTensor data structure, see LocalTensor.

srcTensor

Input

Source operand, with a type of LocalTensor. For details about the definition of the LocalTensor data structure, see LocalTensor.

sharedTmpBuffer

Input

Shared buffer, which is used to store temporary data generated during internal API computation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see ConfusionTranspose Tiling.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

transposeType

Input

Data layout and reshape type, which is the TransposeType enumeration class.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
enum class TransposeType : uint8_t {
    TRANSPOSE_TYPE_NONE,            // default value
    TRANSPOSE_NZ2ND_0213,           // Scenario 1: NZ2ND, axis 1 and axis 2 interchanged
    TRANSPOSE_NZ2NZ_0213,           // Scenario 2: NZ2NZ, axis 1 and axis 2 interchanged
    TRANSPOSE_NZ2NZ_012_WITH_N,     // Scenario 3: NZ2NZ, last axis split
    TRANSPOSE_NZ2ND_012_WITH_N,     // Scenario 4: NZ2ND, last axis split
    TRANSPOSE_NZ2ND_012_WITHOUT_N,  // Scenario 5: NZ2ND, last axis combination
    TRANSPOSE_NZ2NZ_012_WITHOUT_N,  // Scenario 6: NZ2NZ, last axis combination
    TRANSPOSE_ND2ND_ONLY,           // Scenario 7: two-dimensional transpose
    TRANSPOSE_ND_UB_GM,             // Not supported currently.
    TRANSPOSE_GRAD_ND_UB_GM,        // Not supported currently.
    TRANSPOSE_ND2ND_B16,            // Not supported currently.
    TRANSPOSE_NCHW2NHWC,            // Not supported currently.
    TRANSPOSE_NHWC2NCHW             // Not supported currently.
};

tiling

Input

Tiling information required for computation. For details about how to obtain the tiling information, see ConfusionTranspose Tiling.

Returns

None

Availability

Precautions

  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Examples

This example is used in scenario 1 (NZ2ND, axis 1 and axis 2 interchanged).

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, S, N, H/N], ori_shape:[B, S, N, H/N], format:"ND", origin_format:"ND"}

B = 1, N = 2, S = 64, H/N = 32. The input data type is half.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#include "kernel_operator.h"

template <typename T>
class KernelConfusionTranspose {
public:
    __aicore__ inline KernelConfusionTranspose(){}
    __aicore__ inline void Init(__gm__ uint8_t *srcGm, __gm__ uint8_t *dstGm, const ConfusionTransposeTiling &tiling)
    {
        srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm, B * N * S * hnDiv);
        dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm, B * N * S * hnDiv);
        pipe.InitBuffer(inQueueSrcVecIn, 1, B * N * S * hnDiv * sizeof(T));
        pipe.InitBuffer(inQueueSrcVecOut, 1, B * N * S * hnDiv * sizeof(T));
        this->tiling = tiling;
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, srcGlobal, B * N * S * hnDiv);
        inQueueSrcVecIn.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.DeQue<T>();
        AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.AllocTensor<T>();
        AscendC::ConfusionTranspose(dstLocal, srcLocal, AscendC::TransposeType::TRANSPOSE_NZ2ND_0213, this->tiling);
        inQueueSrcVecOut.EnQue<T>(dstLocal);
        inQueueSrcVecIn.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.DeQue<T>();
        AscendC::DataCopy(dstGlobal, dstLocal, B * N * S * hnDiv);
        inQueueSrcVecOut.FreeTensor(dstLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrcVecIn;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> inQueueSrcVecOut;
    AscendC::GlobalTensor<T> srcGlobal;
    AscendC::GlobalTensor<T> dstGlobal;
    uint32_t B = 1;
    uint32_t N = 2;
    uint32_t S = 64;
    uint32_t hnDiv = 32;
    ConfusionTransposeTiling tiling;
};

extern "C" __global__ __aicore__ void confusion_transpose_custom(
    GM_ADDR src_gm, GM_ADDR dst_gm, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelConfusionTranspose<half> op;
    op.Init(src_gm, dst_gm, tilingData.confusionTransposeTilingData);
    op.Process();
}