TransDataTo5HD

Function Usage

Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block. Compared with Transpose that supports only 16 x 16 matrix transpose, this API can process 512-byte data (16 data blocks) in a single repeat, and supports multiple repeats and matrix transposing of different shapes depending on the data type. For example, when the data type is half, 16 x 16 matrix transpose can be completed in a single repeat.

The conversion rules in a single repeat are as follows:

  • When the input data type is int16_t, uint16_t, or half, each data block contains 16 numbers. The instruction loops for 16 times. In each iteration, values are obtained from the corresponding locations in the specified 16 data blocks and a new data block is formed and stored in the destination address. As shown in the following figure, srcLocalList[0]-srcLocalList[15] indicates 16 data blocks of the source operand.
    Figure 1 Conversion rule when the input data type is int16_t, uint16_t, or half
  • When the data type is float, int32_t, or uint32_t, each data block contains eight data blocks. The instruction loops for eight times. In each iteration, values are obtained from the corresponding locations in the specified 16 data blocks and two new data blocks are formed and stored in the destination address. See the following figure.
    Figure 2 Conversion rule when the input data type is float, int32_t, or uint32_t
  • When the data type is int8_t or uint8_t, each data block contains 32 pieces of data. The instruction loops for 16 times. In each iteration, values are obtained from the corresponding locations in the specified 16 data blocks and a half data block is formed and stored in the destination address. The srcHighHalf and dstHighHalf parameters determine whether data is read and stored in the high half or low half of a data block. See the following figure.
    Figure 3 Conversion rule when the input data type is int8_t or uint8_t

Based on the preceding conversion rules, this API is used to convert the NC1HWC0 format or transpose a matrix. The NC1HWC0 format conversion is complex. The following describes how to convert the NC1HWC0 format:

When the NCHW format is converted to the NC1HWC0 format, if the data type is float, int32_t, uint32_t, int16_t, unint16_t or half, C0 is 16; if the data type is uint8_t or int8_t, C0 is 32. The following figure uses C0 = 16 as an example.

Prototype

  • An array consisting of dstLocalList and srcLocalList of type LocaTensor:
    1
    2
    3
    // The value of NCHW_CONV_ADDR_LIST_SIZE is 16.
    template <typename T>
    __aicore__ inline void TransDataTo5HD(const LocalTensor<T> (&dstLocalList)[NCHW_CONV_ADDR_LIST_SIZE], const LocalTensor<T> (&srcLocalList)[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)
    
  • An array consisting of dstList and srcList of type uint64_t. The array elements correspond to the LocaTensor address values. This API has better performance. You can obtain the address value by calling GetPhyAddr of the LocalTensor.
    1
    2
    3
    // The value of NCHW_CONV_ADDR_LIST_SIZE is 16.
    template<typename T>
    __aicore__ inline void TransDataTo5HD(uint64_t dstList[NCHW_CONV_ADDR_LIST_SIZE], uint64_t srcList[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)
    
  • dstLocal and srcLocal are LocalTensors of the uint64_t type. They continuously store the address values of the corresponding LocalTensors. You can obtain the address value by calling GetPhyAddr of the LocalTensor.
    1
    2
    template <typename T>
    __aicore__ inline void TransDataTo5HD(const LocalTensor<uint64_t>& dstLocal, const LocalTensor<uint64_t>& srcLocal, const TransDataTo5HDParams& nchwconvParams)
    

Parameters

Table 1 Template parameters

Parameter

Description

T

Operand data type.

For the Atlas Training Series Product , the supported data types are int8_t, uint8_t, int16_t, uint16_t, and half.

Table 2 Parameters

Parameter

Input/Output

Meaning

dstLocalList/dstList

Output

Address sequence of the destination operand.

The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T.

srcLocalList/srcList

Input

Address sequence of the source operand.

The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T.

The data type must be the same as that of dstLocalList/dstList.

dstLocal

Output

Destination operand.

The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned.

srcLocal

Input

Source operand.

The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned.

nchwconvParams

Input

Data structure for controlling TransdataTo5HD. The structure contains parameters such as the control parameters of the read and write positions, number of iterations, and address stride between adjacent iterations. For details about the definition of this data structure, see Table 3.

Table 3 Parameters in the TransDataTo5HDParams struct

Parameter

Input/Output

Meaning

dstHighHalf

Input

A bool specifying whether to store the data of dstLocalList to the upper or lower half of the data block. Only the int8_t or uint8_t data type is supported.

The supported data type is bool. The options are as follows:
  • True: upper half
  • False: lower half

srcHighHalf

Input

A bool specifying whether to read the data of srcLocalList from the upper or lower part of the data block. Only the int8_t or uint8_t data type is supported.

Selected from:
  • True: Data is read from the upper half of a data block.
  • False: Data is read from the lower half of a data block.

repeatTimes

Input

Number of repeated iterations. repeatTimes ∈ [0,255].

For details about this parameter, see Common Parameters.

Notes:
  • When repeatTimes = 1, the valid start of a destination or source operand is the start of dstLocalList or srcLocalList plus dstRepStride or srcRepStride. To make the valid start of the destination or source operand the same as the start of the dstLocalList or srcLocalList sequence, set dstRepStride or srcRepStride to 0.
  • When repeatTimes > 1, the valid start of a destination or source operand in the first repeat is the start of dstLocalList or srcLocalList. In the second repeat, dstRepStride or srcRepStride needs to be added. This rule applies.

dstRepStride

Input

Datablock-to-datablock stride between adjacent iterations of the destination operand, in the unit of data blocks.

For details, see repeatStride.

srcRepStride

Input

Datablock-to-datablock stride between adjacent iterations of the source operand, in the unit of data blocks.

For details, see repeatStride.

Availability

Atlas Training Series Product

Precautions

  • During the conversion from the NCHW format to the NC1HWC0 format, each element in srcLocalList/dstLocalList is configured as the start point of each HW plane.
  • For better performance, it is recommended that dstHighHalf and srcHighHalf be fixed when the data type is int8_t or uint8_t, and be changed after the repeat in the HW direction.
  • To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
    • For a single repeat (repeatTimes = 1), the source operand sequence and the target operand sequence must be completely the same. Partial overlapping is not supported. Instead, each block must be the same.
    • For multiple repeats (repeatTimes > 1), if there is a dependency between the source operand sequence and the destination operand sequence, that is, the destination operand of the Nth iteration is the source operand of the (N + 1)th iteration, address overlapping is not allowed.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • The addresses in dstLocal and srcLocal must be stored consecutively. For details, see the calling example.

Returns

None

Example

  • The following is an example of converting the NCHW format to the NC1HWC0 format. The input data is of the half type, the NCHW format is (2, 32, 16, 16), and the target NC1HWC0 format is (2, 2, 16, 16, 16).
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    #include "kernel_operator.h"
    
    class KernelTransDataTo5HD {
    public:
        __aicore__ inline KernelTransDataTo5HD() {}
        __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half *)src);
            dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
            AscendC::TransDataTo5HDParams transDataParams;
            transDataParams.dstHighHalf = false;
            transDataParams.srcHighHalf = false;
            transDataParams.repeatTimes = 16;
            transDataParams.dstRepStride = 16;
            transDataParams.srcRepStride = 1;
            for(int j = 0; j < 4; j++) {
                // // Call mode when the input is LocalTensor
                // AscendC::LocalTensor<half> dstLocalList[16];
                // for (int i = 0; i < 16; i++) {
                //     dstLocalList[i] = dstLocal[j * c0size * height * width + width * i];
                // }
                // AscendC::LocalTensor<half> srcLocalList[16];
                // for (int i = 0; i < 16; i++) {
                //     srcLocalList[i] = srcLocal[j * c0size * height * width + height * width * i];
                // }
                // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams);
                // (Recommended) Call mode when the input is LocalTensor address value
                uint64_t dstLocalList[16];
                for (int i = 0; i < 16; i++) {
                   dstLocalList[i] = (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr());
                }
                uint64_t srcLocalList[16];
                for (int i = 0; i < 16; i++) {
                   srcLocalList[i] = (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr());
                }
                AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams);
                // // Call mode when the input is address LocalTensor
                // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
                // for (int i = 0; i < 16; i++) {
                //     dst.SetValue(i, (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr()));
                // }
                // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
                // for (int i = 0; i < 16; i++) {
                //     src.SetValue(i, (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr()));
                // }
                // AscendC::TransDataTo5HD<half>(dst, src, transDataParams);
                // workQueueSrc1.FreeTensor(dst);
                // workQueueSrc2.FreeTensor(src);
            }
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc1;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc2;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 16384;
        int dstDataSize = 16384;
        int width = 16; // H
        int height = 16; // W
        int c0size = 16; // C0
    };
    
    extern "C" __global__ __aicore__ void vec_transdata5hd_b16_nchw2nc1hwc0(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        KernelTransDataTo5HD op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input:
    [[[[ 0.  0.  0. ...  0.  0.  0.]
       [ 0.  0.  0. ...  0.  0.  0.]
       [ 0.  0.  0. ...  0.  0.  0.]
       ...
       [ 0.  0.  0. ...  0.  0.  0.]
       [ 0.  0.  0. ...  0.  0.  0.]
       [ 0.  0.  0. ...  0.  0.  0.]]
    
      [[ 1.  1.  1. ...  1.  1.  1.]
       [ 1.  1.  1. ...  1.  1.  1.]
       [ 1.  1.  1. ...  1.  1.  1.]
       ...
       [ 1.  1.  1. ...  1.  1.  1.]
       [ 1.  1.  1. ...  1.  1.  1.]
       [ 1.  1.  1. ...  1.  1.  1.]]
    
      [[ 2.  2.  2. ...  2.  2.  2.]
       [ 2.  2.  2. ...  2.  2.  2.]
       [ 2.  2.  2. ...  2.  2.  2.]
       ...
       [ 2.  2.  2. ...  2.  2.  2.]
       [ 2.  2.  2. ...  2.  2.  2.]
       [ 2.  2.  2. ...  2.  2.  2.]]
    
      ...
    
      [[29. 29. 29. ... 29. 29. 29.]
       [29. 29. 29. ... 29. 29. 29.]
       [29. 29. 29. ... 29. 29. 29.]
       ...
       [29. 29. 29. ... 29. 29. 29.]
       [29. 29. 29. ... 29. 29. 29.]
       [29. 29. 29. ... 29. 29. 29.]]
    
      [[30. 30. 30. ... 30. 30. 30.]
       [30. 30. 30. ... 30. 30. 30.]
       [30. 30. 30. ... 30. 30. 30.]
       ...
       [30. 30. 30. ... 30. 30. 30.]
       [30. 30. 30. ... 30. 30. 30.]
       [30. 30. 30. ... 30. 30. 30.]]
    
      [[31. 31. 31. ... 31. 31. 31.]
       [31. 31. 31. ... 31. 31. 31.]
       [31. 31. 31. ... 31. 31. 31.]
       ...
       [31. 31. 31. ... 31. 31. 31.]
       [31. 31. 31. ... 31. 31. 31.]
       [31. 31. 31. ... 31. 31. 31.]]]
    
     [[[32. 32. 32. ... 32. 32. 32.]
       [32. 32. 32. ... 32. 32. 32.]
       [32. 32. 32. ... 32. 32. 32.]
       ...
       [32. 32. 32. ... 32. 32. 32.]
       [32. 32. 32. ... 32. 32. 32.]
       [32. 32. 32. ... 32. 32. 32.]]
    
      [[33. 33. 33. ... 33. 33. 33.]
       [33. 33. 33. ... 33. 33. 33.]
       [33. 33. 33. ... 33. 33. 33.]
       ...
       [33. 33. 33. ... 33. 33. 33.]
       [33. 33. 33. ... 33. 33. 33.]
       [33. 33. 33. ... 33. 33. 33.]]
    
      [[34. 34. 34. ... 34. 34. 34.]
       [34. 34. 34. ... 34. 34. 34.]
       [34. 34. 34. ... 34. 34. 34.]
       ...
       [34. 34. 34. ... 34. 34. 34.]
       [34. 34. 34. ... 34. 34. 34.]
       [34. 34. 34. ... 34. 34. 34.]]
    
      ...
    
      [[61. 61. 61. ... 61. 61. 61.]
       [61. 61. 61. ... 61. 61. 61.]
       [61. 61. 61. ... 61. 61. 61.]
       ...
       [61. 61. 61. ... 61. 61. 61.]
       [61. 61. 61. ... 61. 61. 61.]
       [61. 61. 61. ... 61. 61. 61.]]
    
      [[62. 62. 62. ... 62. 62. 62.]
       [62. 62. 62. ... 62. 62. 62.]
       [62. 62. 62. ... 62. 62. 62.]
       ...
       [62. 62. 62. ... 62. 62. 62.]
       [62. 62. 62. ... 62. 62. 62.]
       [62. 62. 62. ... 62. 62. 62.]]
    
      [[63. 63. 63. ... 63. 63. 63.]
       [63. 63. 63. ... 63. 63. 63.]
       [63. 63. 63. ... 63. 63. 63.]
       ...
       [63. 63. 63. ... 63. 63. 63.]
       [63. 63. 63. ... 63. 63. 63.]
       [63. 63. 63. ... 63. 63. 63.]]]]
    Return:
    [[[[[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       ...
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]]
    
      [[[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       ...
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]]]
    
     [[[[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       ...
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]]
    
    
      [[[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       ...
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]]]]
  • Example of int8_t (8-bit) call for transposing 2D matrix data blocks
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    #include "kernel_operator.h"
    
    class KernelTransDataTo5HD {
    public:
        __aicore__ inline KernelTransDataTo5HD() {}
        __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ int8_t *)src);
            dstGlobal.SetGlobalBuffer((__gm__ int8_t *)dstGm);
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int8_t));
            pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int8_t));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.AllocTensor<int8_t>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.DeQue<int8_t>();
            AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.AllocTensor<int8_t>();
            for(int i = 0;i<dstDataSize; i++){
                dstLocal.SetValue(i,0);
            }
            AscendC::TransDataTo5HDParams transDataParams;
            // Write data to the upper half of dstLocalList.
            transDataParams.dstHighHalf = 1;
            // Read data from the upper half of srcLocalList.                                                            
            transDataParams.srcHighHalf = 1;
            transDataParams.repeatTimes = 1;
            transDataParams.dstRepStride = 0;
            transDataParams.srcRepStride = 0;
            // Call mode when the input is LocalTensor
            AscendC::LocalTensor<int8_t> dstLocalList[16];
            for (int i = 0; i < 16; i++) {
                dstLocalList[i] = dstLocal[width * i];
            }
            AscendC::LocalTensor<int8_t> srcLocalList[16];
            for (int i = 0; i < 16; i++) {
                srcLocalList[i] = srcLocal[width * i];
            }
            AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is LocalTensor address value
            // uint64_t dstLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr());
            // }
            // uint64_t srcLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr());
            // }
            // AscendC::TransDataTo5HD<int8_t>(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is address LocalTensor
            // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::TransDataTo5HD<int8_t>(dst, src, transDataParams);
            // workQueueSrc1.FreeTensor(dst);
            // workQueueSrc2.FreeTensor(src);
            outQueueDst.EnQue<int8_t>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.DeQue<int8_t>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc1;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc2;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<int8_t> srcGlobal, dstGlobal;
        int srcDataSize = 512;
        int dstDataSize = 512;
        int width = 32;
    };
    
    extern "C" __global__ __aicore__ void transdata5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        KernelTransDataTo5HD op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input:
    [[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
       18  19  20  21  22  23  24  25  26  27  28  29  30  31]
     [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
       50  51  52  53  54  55  56  57  58  59  60  61  62  63]
     [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
       82  83  84  85  86  87  88  89  90  91  92  93  94  95]
     [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
      114 115 116 117 118 119 120 121 122 123 124 125 126 127]
     [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
       18  19  20  21  22  23  24  25  26  27  28  29  30  31]
     [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
       50  51  52  53  54  55  56  57  58  59  60  61  62  63]
     [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
       82  83  84  85  86  87  88  89  90  91  92  93  94  95]
     [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
      114 115 116 117 118 119 120 121 122 123 124 125 126 127]
     [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
       18  19  20  21  22  23  24  25  26  27  28  29  30  31]
     [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
       50  51  52  53  54  55  56  57  58  59  60  61  62  63]
     [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
       82  83  84  85  86  87  88  89  90  91  92  93  94  95]
     [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
      114 115 116 117 118 119 120 121 122 123 124 125 126 127]
     [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
       18  19  20  21  22  23  24  25  26  27  28  29  30  31]
     [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
       50  51  52  53  54  55  56  57  58  59  60  61  62  63]
     [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
       82  83  84  85  86  87  88  89  90  91  92  93  94  95]
     [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
      114 115 116 117 118 119 120 121 122 123 124 125 126 127]]
    Output:
    // Read data from the upper half of the input data and write data to the upper half of the output data.
    [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 48 80 112 16 48 80 112 16 48 80 112 16 48 80 112 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 49 81 113 17 49 81 113 17 49 81 113 17 49 81 113 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 50 82 114 18 50 82 114 18 50 82 114 18 50 82 114 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 51 83 115 19 51 83 115 19 51 83 115 19 51 83 115 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 52 84 116 20 52 84 116 20 52 84 116 20 52 84 116 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 53 85 117 21 53 85 117 21 53 85 117 21 53 85 117 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 54 86 118 22 54 86 118 22 54 86 118 22 54 86 118 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 55 87 119 23 55 87 119 23 55 87 119 23 55 87 119 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 56 88 120 24 56 88 120 24 56 88 120 24 56 88 120 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 57 89 121 25 57 89 121 25 57 89 121 25 57 89 121 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 58 90 122 26 58 90 122 26 58 90 122 26 58 90 122 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 59 91 123 27 59 91 123 27 59 91 123 27 59 91 123 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 60 92 124 28 60 92 124 28 60 92 124 28 60 92 124 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 61 93 125 29 61 93 125 29 61 93 125 29 61 93 125 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 62 94 126 30 62 94 126 30 62 94 126 30 62 94 126 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31 63 95 127 31 63 95 127 31 63 95 127 31 63 95 127 ]]
  • Example of half (16-bit) call for transposing 2D matrix data blocks
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    #include "kernel_operator.h"
    
    class KernelTransDataTo5HD {
    public:
        __aicore__ inline KernelTransDataTo5HD() {}
        __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half *)src);
            dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
            AscendC::TransDataTo5HDParams transDataParams;
            transDataParams.dstHighHalf = false;
            transDataParams.srcHighHalf = false;
            transDataParams.repeatTimes = 1;
            transDataParams.dstRepStride = 0;
            transDataParams.srcRepStride = 0;
            // Call mode when the input is LocalTensor
            AscendC::LocalTensor<half> dstLocalList[16];
            for (int i = 0; i < 16; i++) {
                dstLocalList[i] = dstLocal[width * i];
            }
            AscendC::LocalTensor<half> srcLocalList[16];
            for (int i = 0; i < 16; i++) {
                srcLocalList[i] = srcLocal[width * i];
            }
            AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is LocalTensor address value
            // uint64_t dstLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr());
            // }
            // uint64_t srcLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr());
            // }
            // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is address LocalTensor
            // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::TransDataTo5HD<half>(dst, src, transDataParams);
            // workQueueSrc1.FreeTensor(dst);
            // workQueueSrc2.FreeTensor(src);
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc1;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc2;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 256;
        int dstDataSize = 256;
        int width = 16;
    };
    
    extern "C" __global__ __aicore__ void nchwconv_demo_first(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        KernelTransDataTo5HD op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input (src):
    [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
       14.  15.]
     [ 16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.
       30.  31.]
     [ 32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.
       46.  47.]
     [ 48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.  60.  61.
       62.  63.]
     [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
       78.  79.]
     [ 80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.
       94.  95.]
     [ 96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.
      110. 111.]
     [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
      126. 127.]
     [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
      142. 143.]
     [144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157.
      158. 159.]
     [160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173.
      174. 175.]
     [176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189.
      190. 191.]
     [192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205.
      206. 207.]
     [208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221.
      222. 223.]
     [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
      238. 239.]
     [240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253.
      254. 255.]]
    
    Output (dstGm):
    [[  0.  16.  32.  48.  64.  80.  96. 112. 128. 144. 160. 176. 192. 208.
      224. 240.]
     [  1.  17.  33.  49.  65.  81.  97. 113. 129. 145. 161. 177. 193. 209.
      225. 241.]
     [  2.  18.  34.  50.  66.  82.  98. 114. 130. 146. 162. 178. 194. 210.
      226. 242.]
     [  3.  19.  35.  51.  67.  83.  99. 115. 131. 147. 163. 179. 195. 211.
      227. 243.]
     [  4.  20.  36.  52.  68.  84. 100. 116. 132. 148. 164. 180. 196. 212.
      228. 244.]
     [  5.  21.  37.  53.  69.  85. 101. 117. 133. 149. 165. 181. 197. 213.
      229. 245.]
     [  6.  22.  38.  54.  70.  86. 102. 118. 134. 150. 166. 182. 198. 214.
      230. 246.]
     [  7.  23.  39.  55.  71.  87. 103. 119. 135. 151. 167. 183. 199. 215.
      231. 247.]
     [  8.  24.  40.  56.  72.  88. 104. 120. 136. 152. 168. 184. 200. 216.
      232. 248.]
     [  9.  25.  41.  57.  73.  89. 105. 121. 137. 153. 169. 185. 201. 217.
      233. 249.]
     [ 10.  26.  42.  58.  74.  90. 106. 122. 138. 154. 170. 186. 202. 218.
      234. 250.]
     [ 11.  27.  43.  59.  75.  91. 107. 123. 139. 155. 171. 187. 203. 219.
      235. 251.]
     [ 12.  28.  44.  60.  76.  92. 108. 124. 140. 156. 172. 188. 204. 220.
      236. 252.]
     [ 13.  29.  45.  61.  77.  93. 109. 125. 141. 157. 173. 189. 205. 221.
      237. 253.]
     [ 14.  30.  46.  62.  78.  94. 110. 126. 142. 158. 174. 190. 206. 222.
      238. 254.]
     [ 15.  31.  47.  63.  79.  95. 111. 127. 143. 159. 175. 191. 207. 223.
      239. 255.]]
  • Example of int32_t (32-bit) call for transposing 2D matrix data blocks
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    #include "kernel_operator.h"
    
    class KernelTransDataTo5HD {
    public:
        __aicore__ inline KernelTransDataTo5HD() {}
        __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ int32_t *)src);
            dstGlobal.SetGlobalBuffer((__gm__ int32_t *)dstGm);
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int32_t));
            pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int32_t));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.AllocTensor<int32_t>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.DeQue<int32_t>();
            AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>();
            AscendC::TransDataTo5HDParams transDataParams;
            transDataParams.dstHighHalf = false;
            transDataParams.srcHighHalf = false;
            transDataParams.repeatTimes = 1;
            transDataParams.dstRepStride = 0;
            transDataParams.srcRepStride = 0;
            // Call mode when the input is LocalTensor
            AscendC::LocalTensor<int32_t> dstLocalList[16];
            for (int i = 0; i < 16; i++) {
                dstLocalList[i] = dstLocal[width * i];
            }
            AscendC::LocalTensor<int32_t> srcLocalList[16];
            for (int i = 0; i < 16; i++) {
                srcLocalList[i] = srcLocal[width * i];
            }
            AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is LocalTensor address value
            // uint64_t dstLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr());
            // }
            // uint64_t srcLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr());
            // }
            // AscendC::TransDataTo5HD<int32_t>(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is address LocalTensor
            // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::TransDataTo5HD<int32_t>(dst, src, transDataParams);
            // workQueueSrc1.FreeTensor(dst);
            // workQueueSrc2.FreeTensor(src);
            outQueueDst.EnQue<int32_t>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc1;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc2;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<int32_t> srcGlobal, dstGlobal;
        int srcDataSize = 128;
        int dstDataSize = 128;
        int width = 8;
    };
    
    extern "C" __global__ __aicore__ void trans5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        KernelTransDataTo5HD op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input (src):
    [  0   1   2   3   4   5   6   7  
       8   9  10  11  12  13  14  15  
      16  17  18  19  20  21  22  23
      24  25  26  27  28  29  30  31 
      32  33  34  35  36  37  38  39
      40  41  42  43  44  45  46  47  
      48  49  50  51  52  53  54  55
      56  57  58  59  60  61  62  63
      64  65  66  67  68  69  70  71
      72  73  74  75  76  77  78  79  
      80  81  82  83  84  85  86  87
      88  89  90  91  92  93  94  95  
      96  97  98  99  100 101 102 103
      104 105 106 107 108 109 110 111 
      112 113 114 115 116 117 118 119
      120 121 122 123 124 125  126 127]
    Output (dstGm):
     [0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120
      1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 
      2 10 18 26 34 42 50 58 66 74 82 90 98 106 114 122 
      3 11 19 27 35 43 51 59 67 75 83 91 99 107 115 123 
      4 12 20 28 36 44 52 60 68 76 84 92 100 108 116 124 
      5 13 21 29 37 45 53 61 69 77 85 93 101 109 117 125 
      6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126
      7 15 23 31 39 47 55 63 71 79 87 95 103 111 119 127]