Transpose
Applicability
|
Product |
Supported |
|---|---|
|
|
√ |
|
|
√ |
|
|
x |
|
|
√ |
|
|
x |
|
|
x |
Function
Performs data layout and reshape operations on input data. The specific functions are as follows:
[Scenario 1: NZ2ND, axis 1 and axis 2 interchanged]
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, S, N, H/N], origin_shape:[B, S, N, H/N], format:"ND", origin_format:"ND"}
[Scenario 2: NZ2NZ, axis 1 and axis 2 interchanged]
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, S, H/N/16, N/16, 16, 16], origin_shape:[B, S, N, H/N], format:"NZ", origin_format:"ND"}
[Scenario 3: NZ2NZ, last axis split]
Input Tensor { shape:[B, H / 16, S / 16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
[Scenario 4: NZ2ND, last axis split]
Input Tensor { shape:[B, H / 16, S / 16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, N, S, H/N], origin_shape:[B, N, S, H/N], format:"ND", origin_format:"ND"}
[Scenario 5: NZ2ND, last axis combination]
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, S, H], origin_shape:[B, S, H], format:"ND", origin_format:"ND"}
[Scenario 6: NZ2NZ, last axis combination]
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, H/16, S/16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}
[Scenario 7: 2D tensor transpose]
2D tensor transpose can be performed on the UB. The values of H and W in srcShape are integer multiples of 16.

Principles
The following figure shows the algorithm block diagram for each of the seven functional scenarios of transpose.
The computation process is as follows:
Perform cyclic processing in the H/N, N, and B directions in sequence.
- First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares along the S direction into temp, and store them consecutively in temp.
- Second TransDataTo5HD step: Transpose the S/16 16 × 16 squares from temp to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N) × N elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by 16 elements.
The computation process is as follows:
Perform cyclic processing in the H/N, N, and B directions in sequence.
- First TransDataTo5HD step: Take S/16 consecutive 16 × 16 squares along the S direction into temp, and store them consecutively in temp.
- Second TransDataTo5HD step: Transpose the S/16 16 × 16 squares from temp to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N) × N elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by N × 16 elements.
The computation process is as follows:
Perform cyclic processing in the H and B directions in sequence.
- First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares into temp1 each time.
- DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to temp2 for the last time.
- Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by S × 16 elements.
The computation process is as follows:
Perform cyclic processing in the H and B directions in sequence.
- First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares into temp1 each time.
- DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
- Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N + 16 – 1)/16 × 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by (H/N + 16 – 1)/16 × 16 × S elements.
The computation process is as follows:
Perform cyclic processing in the H and B directions in sequence.
- First TransDataTo5HD step: Transpose an S × 16 square to temp1 each time.
- DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
- Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H + 16 – 1)/16 × 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by H/N × S elements.
The computation process is as follows:
Perform cyclic processing in the H and B directions in sequence.
- First TransDataTo5HD step: Transpose an S × 16 square to temp1 each time.
- DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
- Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by S × 16 elements.
The computation process is as follows:
- Call TransDataTo5HD to transpose [H, W] to [W, H] by setting different source operand and destination operand address sequences. The format is ND in both src and dst.
Prototype
Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetTransposeMaxMinTmpSize API provided in Transpose Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.
The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the Transpose API.
- Pass the temporary space through the sharedTmpBuffer input parameter.
1 2
template <typename T> __aicore__ inline void Transpose(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<uint8_t> &sharedTmpBuffer, TransposeType transposeType, ConfusionTransposeTiling& tiling)
This method enables developers to allocate and manage the temporary memory space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated or deallocated, improving the flexibility and buffer utilization.
- Allocate the temporary space through the API framework.
1 2
template <typename T> __aicore__ inline void Transpose(const LocalTensor<T>& dst, const LocalTensor<T>& src, TransposeType transposeType, ConfusionTransposeTiling& tiling)
When using this method, developers do not need to allocate the space, but must reserve the required size for the space.
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Data type of the operand. For the For the For the |
|
Parameter |
Input/Output |
Description |
||
|---|---|---|---|---|
|
dst |
Output |
Destination operand. For details about the definition of the LocalTensor data structure, see LocalTensor. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
|
src |
Input |
Source operand. For details about the definition of the LocalTensor data structure, see LocalTensor. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
|
sharedTmpBuffer |
Input |
Shared buffer, which is used to store temporary data generated during internal API computation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see Transpose Tiling. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
|
transposeType |
Input |
Data layout and reshape type. The type is the TransposeType enumeration type.
|
||
|
tiling |
Input |
Tiling information required for computation. For details about how to obtain the tiling information, see Transpose Tiling. |
Returns
None
Restrictions
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
Example
This example is used in scenario 1 (NZ2ND, axis 1 and axis 2 interchanged).
Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}
Output Tensor { shape:[B, S, N, H/N], origin_shape:[B, S, N, H/N], format:"ND", origin_format:"ND"}
B = 1, N = 2, S = 64, H/N = 32. The input data type is half.
1 2 3 4 5 6 |
AscendC::TPipe *pipe = pipeIn; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn; AscendC::TQue<AscendC::TPosition::VECOUT, 1> inQueueSrcVecOut; pipe->InitBuffer(inQueueSrcVecIn, 1, b * n * s * hnDiv * sizeof(T)); pipe->InitBuffer(inQueueSrcVecOut, 1, b * n * s * hnDiv * sizeof(T)); AscendC::Transpose(dst, src, AscendC::TransposeType::TRANSPOSE_NZ2ND_0213, this->tiling); |