Transpose

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Performs data layout and reshape operations on input data. The specific functions are as follows:

[Scenario 1: NZ2ND, axis 1 and axis 2 interchanged]

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, S, N, H/N], origin_shape:[B, S, N, H/N], format:"ND", origin_format:"ND"}

Figure 1 Data layout and reshape in scenario 1

[Scenario 2: NZ2NZ, axis 1 and axis 2 interchanged]

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, S, H/N/16, N/16, 16, 16], origin_shape:[B, S, N, H/N], format:"NZ", origin_format:"ND"}

Figure 2 Data layout and reshape in scenario 2

[Scenario 3: NZ2NZ, last axis split]

Input Tensor { shape:[B, H / 16, S / 16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Figure 3 Data layout and reshape in scenario 3

[Scenario 4: NZ2ND, last axis split]

Input Tensor { shape:[B, H / 16, S / 16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, N, S, H/N], origin_shape:[B, N, S, H/N], format:"ND", origin_format:"ND"}

Figure 4 Data layout and reshape in scenario 4

[Scenario 5: NZ2ND, last axis combination]

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, S, H], origin_shape:[B, S, H], format:"ND", origin_format:"ND"}

Figure 5 Data layout and reshape in scenario 5

[Scenario 6: NZ2NZ, last axis combination]

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, H/16, S/16, 16, 16], origin_shape:[B, S, H], format:"NZ", origin_format:"ND"}

Figure 6 Data layout and reshape in scenario 6

[Scenario 7: 2D tensor transpose]

2D tensor transpose can be performed on the UB. The values of H and W in srcShape are integer multiples of 16.

Figure 7 Data layout and reshape in scenario 7

Principles

The following figure shows the algorithm block diagram for each of the seven functional scenarios of transpose.

Figure 8 Scenario 1: NZ2ND, axis 1 and axis 2 interchanged

The computation process is as follows:

Perform cyclic processing in the H/N, N, and B directions in sequence.

First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares along the S direction into temp, and store them consecutively in temp.
Second TransDataTo5HD step: Transpose the S/16 16 × 16 squares from temp to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N) × N elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by 16 elements.

Figure 9 Scenario 2: NZ2NZ, axis 1 and axis 2 interchanged

The computation process is as follows:

Perform cyclic processing in the H/N, N, and B directions in sequence.

First TransDataTo5HD step: Take S/16 consecutive 16 × 16 squares along the S direction into temp, and store them consecutively in temp.
Second TransDataTo5HD step: Transpose the S/16 16 × 16 squares from temp to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N) × N elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by N × 16 elements.

Figure 10 Scenario 3: NZ2NZ, last axis split

The computation process is as follows:

Perform cyclic processing in the H and B directions in sequence.

First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares into temp1 each time.
DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to temp2 for the last time.
Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by S × 16 elements.

Figure 11 Scenario 4: NZ2ND, last axis split

The computation process is as follows:

Perform cyclic processing in the H and B directions in sequence.

First TransDataTo5HD step: Transpose S/16 consecutive 16 × 16 squares into temp1 each time.
DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H/N + 16 – 1)/16 × 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by (H/N + 16 – 1)/16 × 16 × S elements.

Figure 12 Scenario 5: NZ2ND, last axis combination

The computation process is as follows:

Perform cyclic processing in the H and B directions in sequence.

First TransDataTo5HD step: Transpose an S × 16 square to temp1 each time.
DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format ND is used, the address of two consecutive rows of data from the same square on the destination operand is offset by (H + 16 – 1)/16 × 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by H/N × S elements.

Figure 13 Scenario 6: NZ2NZ, last axis combination

The computation process is as follows:

Perform cyclic processing in the H and B directions in sequence.

First TransDataTo5HD step: Transpose an S × 16 square to temp1 each time.
DataCopy step: When H/N ≤ 16, H/N × S elements are moved to temp2 each time. When H/N > 16, 16 × S elements are moved to temp2 for the first H/N/16 times, and H/N%16 × S elements are moved to tmp2 for the last time.
Second TransDataTo5HD step: Transpose the 16 × S squares from temp2 to dst. In dst, the format NZ is used, the address of two consecutive rows of data from the same square on the destination operand is offset by 16 elements, and the address of the same row of data from every two squares on the destination operand in the H direction is offset by S × 16 elements.

Figure 14 Scenario 7: two-dimensional transpose

The computation process is as follows:

Call TransDataTo5HD to transpose [H, W] to [W, H] by setting different source operand and destination operand address sequences. The format is ND in both src and dst.

Prototype

Due to the complex computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The method of obtaining the temporary space size (BufferSize) is as follows: Obtain the required maximum and minimum temporary space sizes using the GetTransposeMaxMinTmpSize API provided in Transpose Tiling. The minimum space can ensure correct functionality, while the maximum space is used to improve performance.

The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter. Therefore, there are two types of function prototypes for the Transpose API.

Pass the temporary space through the sharedTmpBuffer input parameter.

        
             template <typename T>
__aicore__ inline void Transpose(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<uint8_t> &sharedTmpBuffer, TransposeType transposeType, ConfusionTransposeTiling& tiling)

This method enables developers to allocate and manage the temporary memory space on their own, and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated or deallocated, improving the flexibility and buffer utilization.

Allocate the temporary space through the API framework.

        
             template <typename T>
__aicore__ inline void Transpose(const LocalTensor<T>& dst, const LocalTensor<T>& src, TransposeType transposeType, ConfusionTransposeTiling& tiling)

When using this method, developers do not need to allocate the space, but must reserve the required size for the space.

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the operand. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are int16_t, uint16_t, half, int32_t, uint32_t, and float. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are int16_t, uint16_t, half, int32_t, uint32_t, and float. For the Atlas inference product 's AI Core, the supported data types are int16_t, uint16_t, half, int32_t, uint32_t, and float.

Table 2 API parameters

Parameter

Input/Output

Description

dst

Output

Destination operand. For details about the definition of the LocalTensor data structure, see LocalTensor.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

src

Input

Source operand. For details about the definition of the LocalTensor data structure, see LocalTensor.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

sharedTmpBuffer

Input

Shared buffer, which is used to store temporary data generated during internal API computation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization. For details about how to obtain the size of the shared buffer, see Transpose Tiling.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

transposeType

Input

Data layout and reshape type. The type is the TransposeType enumeration type.

           
                enum class TransposeType : uint8_t {
    TRANSPOSE_TYPE_NONE,            // default value
    TRANSPOSE_NZ2ND_0213,           // Scenario 1: NZ2ND, axis 1 and axis 2 interchanged
    TRANSPOSE_NZ2NZ_0213,           // Scenario 2: NZ2NZ, axis 1 and axis 2 interchanged
    TRANSPOSE_NZ2NZ_012_WITH_N,     // Scenario 3: NZ2NZ, last axis split
    TRANSPOSE_NZ2ND_012_WITH_N,     // Scenario 4: NZ2ND, last axis split
    TRANSPOSE_NZ2ND_012_WITHOUT_N,  // Scenario 5: NZ2ND, last axis combination
    TRANSPOSE_NZ2NZ_012_WITHOUT_N,  // Scenario 6: NZ2NZ, last axis combination
    TRANSPOSE_ND2ND_ONLY,           // Scenario 7: two-dimensional transpose
    TRANSPOSE_ND_UB_GM,             // Not supported currently
    TRANSPOSE_GRAD_ND_UB_GM,        // Not supported currently
    TRANSPOSE_ND2ND_B16,            // Not supported currently
    TRANSPOSE_NCHW2NHWC,            // Not supported currently
    TRANSPOSE_NHWC2NCHW            // Not supported currently.
    
    };

tiling

Input

Tiling information required for computation. For details about how to obtain the tiling information, see Transpose Tiling.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.

Example

This example is used in scenario 1 (NZ2ND, axis 1 and axis 2 interchanged).

Input Tensor { shape:[B, N, H/N/16, S/16, 16, 16], origin_shape:[B, N, S, H/N], format:"NZ", origin_format:"ND"}

Output Tensor { shape:[B, S, N, H/N], origin_shape:[B, S, N, H/N], format:"ND", origin_format:"ND"}

B = 1, N = 2, S = 64, H/N = 32. The input data type is half.

      
           AscendC::TPipe *pipe = pipeIn;
AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn;
AscendC::TQue<AscendC::TPosition::VECOUT, 1> inQueueSrcVecOut;
pipe->InitBuffer(inQueueSrcVecIn, 1, b * n * s * hnDiv * sizeof(T));
pipe->InitBuffer(inQueueSrcVecOut, 1, b * n * s * hnDiv * sizeof(T));
AscendC::Transpose(dst, src, AscendC::TransposeType::TRANSPOSE_NZ2ND_0213, this->tiling);

Parent topic: Tensor Transformation