TransDataTo5HD

Applicability

Product

Supported/Unsupported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product 's AI Core

Atlas inference product 's Vector Core

x

Atlas training products

Function Usage

Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block. Compared with Transpose that supports only 16 x 16 matrix transpose, this API can process 512-byte data (16 data blocks) in a single repeat, and supports multiple repeats and matrix transposing of different shapes depending on the data type. For example, when the data type is half, 16 x 16 matrix transpose can be completed in a single repeat.

The conversion rules in a single repeat are as follows:

  • When the input data type bit width is 16 bits, each data block contains 16 numbers. The instruction is looped 16 times. In each loop, values are obtained from the corresponding positions in the specified 16 data blocks, and a new data block is formed and stored in the destination address. As shown in the following figure, srcList[0] to srcList[15] indicate 16 data blocks of the source operand.
    Figure 1 Conversion rules when the input data type bit width is 16 bits
  • When the data type bit width is 32 bits, each data block contains eight numbers. The instruction is looped eight times. In each loop, values are obtained from the corresponding positions in the specified 16 data blocks, and two new data blocks are formed and stored in the destination address. See the following figure.
    Figure 2 Conversion rules when the input data type bit width is 32 bits
  • When the data type bit width is 8 bits, each data block contains 32 numbers. The instruction is looped 16 times. In each loop, values are obtained from the corresponding positions in the specified 16 data blocks, and half of a data block is formed and stored in the destination address. Whether the read and stored data is in the upper or lower half of the data block is determined by the srcHighHalf and dstHighHalf parameters. See the following figure.
    Figure 3 Conversion rules when the input data type bit width is 8 bits

Based on the preceding conversion rules, this API is used to convert the NC1HWC0 format or transpose a matrix. The NC1HWC0 format conversion is complex. The following describes how to convert the NC1HWC0 format:

When the NCHW format is converted to the NC1HWC0 format, if the bit width of the data type is 32 bits or 16 bits, C0 is 16; if the bit width of the data type is 8 bits, C0 is 32. The following figure uses C0 = 16 as an example.

Prototype

  • dstList and srcList are arrays of type LocalTensor.
    1
    2
    3
    // The value of NCHW_CONV_ADDR_LIST_SIZE is 16.
    template <typename T>
    __aicore__ inline void TransDataTo5HD(const LocalTensor<T> (&dstList)[NCHW_CONV_ADDR_LIST_SIZE], const LocalTensor<T> (&srcList)[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)
    
  • An array consisting of dstList and srcList of type uint64_t. The array elements correspond to the LocaTensor address values. This API has better performance. You can obtain the address value by calling GetPhyAddr of the LocalTensor.
    1
    2
    3
    // The value of NCHW_CONV_ADDR_LIST_SIZE is 16.
    template<typename T>
    __aicore__ inline void TransDataTo5HD(uint64_t dstList[NCHW_CONV_ADDR_LIST_SIZE], uint64_t srcList[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)
    
  • dst and src are LocalTensors of type uint64_t, and the address values of the corresponding LocalTensors are stored continuously. You can obtain the address value by calling GetPhyAddr of the LocalTensor.
    1
    2
    template <typename T>
    __aicore__ inline void TransDataTo5HD(const LocalTensor<uint64_t>& dst, const LocalTensor<uint64_t>& src, const TransDataTo5HDParams& nchwconvParams)
    

Parameters

Table 1 Template parameters

Parameter

Description

T

Operand data type.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float.

For the Atlas inference product 's AI Core, the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float.

For the Atlas training products , the supported data types are int8_t, uint8_t, int16_t, uint16_t, and half.

Table 2 Parameters

Parameter

Input/Output

Meaning

dstList

Output

Address sequence of the destination operand.

The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T.

srcList

Input

Address sequence of the source operand.

The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T.

The data type must be the same as that of dstList.

dst

Output

Destination operand.

The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned.

src

Input

Source operand.

The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned.

nchwconvParams

Input

Data structure for controlling TransdataTo5HD. The structure contains parameters such as the control parameters of the read and write positions, number of iterations, and address stride between adjacent iterations.

For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_transpose.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

For details about the parameter description, see Table 3.

Table 3 Parameters in the TransDataTo5HDParams structure

Parameter

Input/Output

Meaning

dstHighHalf

Input

A bool specifying whether to store the data of dstLocalList to the upper or lower half of the data block. Only the int8_t or uint8_t type is supported.

Selected from:
  • True: upper half
  • False: lower half

srcHighHalf

Input

A bool specifying whether to read the data of srcLocalList from the upper or lower half of the data block. Only the int8_t or uint8_t type is supported.

Selected from:
  • True: Data is read from the upper half of a data block.
  • False: Data is read from the lower half of a data block.

repeatTimes

Input

Number of iteration repeats. repeatTimes ∈ [0,255].

For details about this parameter, see High-dimensional Sharding APIs.

Notes:
  • When repeatTimes is 1, the valid start position of the destination operand or source operand is the start position of the input of the dstList or srcList sequence plus dstRepStride or srcRepStride. When repeatTimes is 1, if you want the valid start position of the destination operand or source operand to be the start position of the input of the dstList or srcList sequence, set dstRepStride or srcRepStride to 0.
  • When repeatTimes is greater than 1, the valid start position of the destination operand or source operand in the first repeat is the start position of the input of the dstList or srcList sequence, and the valid start position of the destination operand or source operand in the second repeat is the start position of the input of the dstList or srcList sequence plus dstRepStride or srcRepStride. This rule applies.

dstRepStride

Input

Datablock-to-datablock stride between adjacent repeats of the destination operand, in the unit of data blocks.

For details, see repeatStride.

srcRepStride

Input

Datablock-to-datablock stride between adjacent repeats of the source operand, in the unit of data blocks.

For details, see repeatStride.

Restrictions

  • When converting the format from NCHW to NC1HWC0, each element in srcList or dstList is generally configured as the start point of each HW plane.
  • To achieve better performance, you are advised to fix dstHighHalf and srcHighHalf first, and then change dstHighHalf and srcHighHalf after the repeat in the HW direction when the bit width of the data type is 8 bits.
  • The addresses in dst and src must be stored continuously. For details, see the calling example.

Returns

None

Example

  • The following is an example of converting the NCHW format to the NC1HWC0 format. The input data is of the half type, the NCHW format is (2, 32, 16, 16), and the target NC1HWC0 format is (2, 2, 16, 16, 16).
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    #include "kernel_operator.h"
    
    class KernelTransDataTo5HD {
    public:
        __aicore__ inline KernelTransDataTo5HD() {}
        __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half *)src);
            dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
            AscendC::TransDataTo5HDParams transDataParams;
            transDataParams.dstHighHalf = false;
            transDataParams.srcHighHalf = false;
            transDataParams.repeatTimes = 16;
            transDataParams.dstRepStride = 16;
            transDataParams.srcRepStride = 1;
            for(int j = 0; j < 4; j++) {
                // // Call mode when the input is LocalTensor
                // AscendC::LocalTensor<half> dstLocalList[16];
                // for (int i = 0; i < 16; i++) {
                //     dstLocalList[i] = dstLocal[j * c0size * height * width + width * i];
                // }
                // AscendC::LocalTensor<half> srcLocalList[16];
                // for (int i = 0; i < 16; i++) {
                //     srcLocalList[i] = srcLocal[j * c0size * height * width + height * width * i];
                // }
                // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams);
                // (Recommended) Call mode when the input is LocalTensor address value
                uint64_t dstLocalList[16];
                for (int i = 0; i < 16; i++) {
                   dstLocalList[i] = (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr());
                }
                uint64_t srcLocalList[16];
                for (int i = 0; i < 16; i++) {
                   srcLocalList[i] = (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr());
                }
                AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams);
                // // Call mode when the input is address LocalTensor
                // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
                // for (int i = 0; i < 16; i++) {
                //     dst.SetValue(i, (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr()));
                // }
                // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
                // for (int i = 0; i < 16; i++) {
                //     src.SetValue(i, (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr()));
                // }
                // AscendC::TransDataTo5HD<half>(dst, src, transDataParams);
                // workQueueSrc1.FreeTensor(dst);
                // workQueueSrc2.FreeTensor(src);
            }
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 16384;
        int dstDataSize = 16384;
        int width = 16; // H
        int height = 16; // W
        int c0size = 16; // C0
    };
    
    extern "C" __global__ __aicore__ void vec_transdata5hd_b16_nchw2nc1hwc0(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        KernelTransDataTo5HD op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input:
    [[[[ 0.  0.  0. ...  0.  0.  0.]
       [ 0.  0.  0. ...  0.  0.  0.]
       [ 0.  0.  0. ...  0.  0.  0.]
       ...
       [ 0.  0.  0. ...  0.  0.  0.]
       [ 0.  0.  0. ...  0.  0.  0.]
       [ 0.  0.  0. ...  0.  0.  0.]]
    
      [[ 1.  1.  1. ...  1.  1.  1.]
       [ 1.  1.  1. ...  1.  1.  1.]
       [ 1.  1.  1. ...  1.  1.  1.]
       ...
       [ 1.  1.  1. ...  1.  1.  1.]
       [ 1.  1.  1. ...  1.  1.  1.]
       [ 1.  1.  1. ...  1.  1.  1.]]
    
      [[ 2.  2.  2. ...  2.  2.  2.]
       [ 2.  2.  2. ...  2.  2.  2.]
       [ 2.  2.  2. ...  2.  2.  2.]
       ...
       [ 2.  2.  2. ...  2.  2.  2.]
       [ 2.  2.  2. ...  2.  2.  2.]
       [ 2.  2.  2. ...  2.  2.  2.]]
    
      ...
    
      [[29. 29. 29. ... 29. 29. 29.]
       [29. 29. 29. ... 29. 29. 29.]
       [29. 29. 29. ... 29. 29. 29.]
       ...
       [29. 29. 29. ... 29. 29. 29.]
       [29. 29. 29. ... 29. 29. 29.]
       [29. 29. 29. ... 29. 29. 29.]]
    
      [[30. 30. 30. ... 30. 30. 30.]
       [30. 30. 30. ... 30. 30. 30.]
       [30. 30. 30. ... 30. 30. 30.]
       ...
       [30. 30. 30. ... 30. 30. 30.]
       [30. 30. 30. ... 30. 30. 30.]
       [30. 30. 30. ... 30. 30. 30.]]
    
      [[31. 31. 31. ... 31. 31. 31.]
       [31. 31. 31. ... 31. 31. 31.]
       [31. 31. 31. ... 31. 31. 31.]
       ...
       [31. 31. 31. ... 31. 31. 31.]
       [31. 31. 31. ... 31. 31. 31.]
       [31. 31. 31. ... 31. 31. 31.]]]
    
     [[[32. 32. 32. ... 32. 32. 32.]
       [32. 32. 32. ... 32. 32. 32.]
       [32. 32. 32. ... 32. 32. 32.]
       ...
       [32. 32. 32. ... 32. 32. 32.]
       [32. 32. 32. ... 32. 32. 32.]
       [32. 32. 32. ... 32. 32. 32.]]
    
      [[33. 33. 33. ... 33. 33. 33.]
       [33. 33. 33. ... 33. 33. 33.]
       [33. 33. 33. ... 33. 33. 33.]
       ...
       [33. 33. 33. ... 33. 33. 33.]
       [33. 33. 33. ... 33. 33. 33.]
       [33. 33. 33. ... 33. 33. 33.]]
    
      [[34. 34. 34. ... 34. 34. 34.]
       [34. 34. 34. ... 34. 34. 34.]
       [34. 34. 34. ... 34. 34. 34.]
       ...
       [34. 34. 34. ... 34. 34. 34.]
       [34. 34. 34. ... 34. 34. 34.]
       [34. 34. 34. ... 34. 34. 34.]]
    
      ...
    
      [[61. 61. 61. ... 61. 61. 61.]
       [61. 61. 61. ... 61. 61. 61.]
       [61. 61. 61. ... 61. 61. 61.]
       ...
       [61. 61. 61. ... 61. 61. 61.]
       [61. 61. 61. ... 61. 61. 61.]
       [61. 61. 61. ... 61. 61. 61.]]
    
      [[62. 62. 62. ... 62. 62. 62.]
       [62. 62. 62. ... 62. 62. 62.]
       [62. 62. 62. ... 62. 62. 62.]
       ...
       [62. 62. 62. ... 62. 62. 62.]
       [62. 62. 62. ... 62. 62. 62.]
       [62. 62. 62. ... 62. 62. 62.]]
    
      [[63. 63. 63. ... 63. 63. 63.]
       [63. 63. 63. ... 63. 63. 63.]
       [63. 63. 63. ... 63. 63. 63.]
       ...
       [63. 63. 63. ... 63. 63. 63.]
       [63. 63. 63. ... 63. 63. 63.]
       [63. 63. 63. ... 63. 63. 63.]]]]
    Return:
    [[[[[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       ...
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]
    
       [[ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        ...
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]
        [ 0.  1.  2. ... 13. 14. 15.]]]
    
      [[[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       ...
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]
    
       [[16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        ...
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]
        [16. 17. 18. ... 29. 30. 31.]]]]
    
     [[[[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       ...
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]
    
       [[32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        ...
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]
        [32. 33. 34. ... 45. 46. 47.]]]
    
    
      [[[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       ...
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]
    
       [[48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        ...
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]
        [48. 49. 50. ... 61. 62. 63.]]]]]
  • Example of int8_t (8-bit) call for transposing 2D matrix data blocks
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    #include "kernel_operator.h"
    
    class KernelTransDataTo5HD {
    public:
        __aicore__ inline KernelTransDataTo5HD() {}
        __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ int8_t *)src);
            dstGlobal.SetGlobalBuffer((__gm__ int8_t *)dstGm);
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int8_t));
            pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int8_t));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.AllocTensor<int8_t>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.DeQue<int8_t>();
            AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.AllocTensor<int8_t>();
            for(int i = 0;i<dstDataSize; i++){
                dstLocal.SetValue(i,0);
            }
            AscendC::TransDataTo5HDParams transDataParams;
            // Write data to the upper half of dstLocalList.
            transDataParams.dstHighHalf = 1;
            // Read data from the upper half of srcLocalList.                                                            
            transDataParams.srcHighHalf = 1;
            transDataParams.repeatTimes = 1;
            transDataParams.dstRepStride = 0;
            transDataParams.srcRepStride = 0;
            // Call mode when the input is LocalTensor
            AscendC::LocalTensor<int8_t> dstLocalList[16];
            for (int i = 0; i < 16; i++) {
                dstLocalList[i] = dstLocal[width * i];
            }
            AscendC::LocalTensor<int8_t> srcLocalList[16];
            for (int i = 0; i < 16; i++) {
                srcLocalList[i] = srcLocal[width * i];
            }
            AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is LocalTensor address value
            // uint64_t dstLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr());
            // }
            // uint64_t srcLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr());
            // }
            // AscendC::TransDataTo5HD<int8_t>(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is address LocalTensor
            // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::TransDataTo5HD<int8_t>(dst, src, transDataParams);
            // workQueueSrc1.FreeTensor(dst);
            // workQueueSrc2.FreeTensor(src);
            outQueueDst.EnQue<int8_t>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.DeQue<int8_t>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<int8_t> srcGlobal, dstGlobal;
        int srcDataSize = 512;
        int dstDataSize = 512;
        int width = 32;
    };
    
    extern "C" __global__ __aicore__ void transdata5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        KernelTransDataTo5HD op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input:
    [[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
       18  19  20  21  22  23  24  25  26  27  28  29  30  31]
     [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
       50  51  52  53  54  55  56  57  58  59  60  61  62  63]
     [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
       82  83  84  85  86  87  88  89  90  91  92  93  94  95]
     [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
      114 115 116 117 118 119 120 121 122 123 124 125 126 127]
     [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
       18  19  20  21  22  23  24  25  26  27  28  29  30  31]
     [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
       50  51  52  53  54  55  56  57  58  59  60  61  62  63]
     [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
       82  83  84  85  86  87  88  89  90  91  92  93  94  95]
     [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
      114 115 116 117 118 119 120 121 122 123 124 125 126 127]
     [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
       18  19  20  21  22  23  24  25  26  27  28  29  30  31]
     [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
       50  51  52  53  54  55  56  57  58  59  60  61  62  63]
     [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
       82  83  84  85  86  87  88  89  90  91  92  93  94  95]
     [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
      114 115 116 117 118 119 120 121 122 123 124 125 126 127]
     [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
       18  19  20  21  22  23  24  25  26  27  28  29  30  31]
     [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
       50  51  52  53  54  55  56  57  58  59  60  61  62  63]
     [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
       82  83  84  85  86  87  88  89  90  91  92  93  94  95]
     [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
      114 115 116 117 118 119 120 121 122 123 124 125 126 127]]
    Output:
    // Read data from the upper half of the input data and write data to the upper half of the output data.
    [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 48 80 112 16 48 80 112 16 48 80 112 16 48 80 112 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 49 81 113 17 49 81 113 17 49 81 113 17 49 81 113 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 50 82 114 18 50 82 114 18 50 82 114 18 50 82 114 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 51 83 115 19 51 83 115 19 51 83 115 19 51 83 115 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 52 84 116 20 52 84 116 20 52 84 116 20 52 84 116 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 53 85 117 21 53 85 117 21 53 85 117 21 53 85 117 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 54 86 118 22 54 86 118 22 54 86 118 22 54 86 118 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 55 87 119 23 55 87 119 23 55 87 119 23 55 87 119 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 56 88 120 24 56 88 120 24 56 88 120 24 56 88 120 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 57 89 121 25 57 89 121 25 57 89 121 25 57 89 121 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 58 90 122 26 58 90 122 26 58 90 122 26 58 90 122 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 59 91 123 27 59 91 123 27 59 91 123 27 59 91 123 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 60 92 124 28 60 92 124 28 60 92 124 28 60 92 124 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 61 93 125 29 61 93 125 29 61 93 125 29 61 93 125 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 62 94 126 30 62 94 126 30 62 94 126 30 62 94 126 ]
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31 63 95 127 31 63 95 127 31 63 95 127 31 63 95 127 ]]
  • Example of half (16-bit) call for transposing 2D matrix data blocks
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    #include "kernel_operator.h"
    
    class KernelTransDataTo5HD {
    public:
        __aicore__ inline KernelTransDataTo5HD() {}
        __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half *)src);
            dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
            AscendC::TransDataTo5HDParams transDataParams;
            transDataParams.dstHighHalf = false;
            transDataParams.srcHighHalf = false;
            transDataParams.repeatTimes = 1;
            transDataParams.dstRepStride = 0;
            transDataParams.srcRepStride = 0;
            // Call mode when the input is LocalTensor
            AscendC::LocalTensor<half> dstLocalList[16];
            for (int i = 0; i < 16; i++) {
                dstLocalList[i] = dstLocal[width * i];
            }
            AscendC::LocalTensor<half> srcLocalList[16];
            for (int i = 0; i < 16; i++) {
                srcLocalList[i] = srcLocal[width * i];
            }
            AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is LocalTensor address value
            // uint64_t dstLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr());
            // }
            // uint64_t srcLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr());
            // }
            // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is address LocalTensor
            // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::TransDataTo5HD<half>(dst, src, transDataParams);
            // workQueueSrc1.FreeTensor(dst);
            // workQueueSrc2.FreeTensor(src);
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 256;
        int dstDataSize = 256;
        int width = 16;
    };
    
    extern "C" __global__ __aicore__ void nchwconv_demo_first(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        KernelTransDataTo5HD op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input (src):
    [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
       14.  15.]
     [ 16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.
       30.  31.]
     [ 32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.
       46.  47.]
     [ 48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.  60.  61.
       62.  63.]
     [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
       78.  79.]
     [ 80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.
       94.  95.]
     [ 96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.
      110. 111.]
     [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
      126. 127.]
     [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
      142. 143.]
     [144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157.
      158. 159.]
     [160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173.
      174. 175.]
     [176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189.
      190. 191.]
     [192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205.
      206. 207.]
     [208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221.
      222. 223.]
     [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
      238. 239.]
     [240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253.
      254. 255.]]
    
    Output (dstGm):
    [[  0.  16.  32.  48.  64.  80.  96. 112. 128. 144. 160. 176. 192. 208.
      224. 240.]
     [  1.  17.  33.  49.  65.  81.  97. 113. 129. 145. 161. 177. 193. 209.
      225. 241.]
     [  2.  18.  34.  50.  66.  82.  98. 114. 130. 146. 162. 178. 194. 210.
      226. 242.]
     [  3.  19.  35.  51.  67.  83.  99. 115. 131. 147. 163. 179. 195. 211.
      227. 243.]
     [  4.  20.  36.  52.  68.  84. 100. 116. 132. 148. 164. 180. 196. 212.
      228. 244.]
     [  5.  21.  37.  53.  69.  85. 101. 117. 133. 149. 165. 181. 197. 213.
      229. 245.]
     [  6.  22.  38.  54.  70.  86. 102. 118. 134. 150. 166. 182. 198. 214.
      230. 246.]
     [  7.  23.  39.  55.  71.  87. 103. 119. 135. 151. 167. 183. 199. 215.
      231. 247.]
     [  8.  24.  40.  56.  72.  88. 104. 120. 136. 152. 168. 184. 200. 216.
      232. 248.]
     [  9.  25.  41.  57.  73.  89. 105. 121. 137. 153. 169. 185. 201. 217.
      233. 249.]
     [ 10.  26.  42.  58.  74.  90. 106. 122. 138. 154. 170. 186. 202. 218.
      234. 250.]
     [ 11.  27.  43.  59.  75.  91. 107. 123. 139. 155. 171. 187. 203. 219.
      235. 251.]
     [ 12.  28.  44.  60.  76.  92. 108. 124. 140. 156. 172. 188. 204. 220.
      236. 252.]
     [ 13.  29.  45.  61.  77.  93. 109. 125. 141. 157. 173. 189. 205. 221.
      237. 253.]
     [ 14.  30.  46.  62.  78.  94. 110. 126. 142. 158. 174. 190. 206. 222.
      238. 254.]
     [ 15.  31.  47.  63.  79.  95. 111. 127. 143. 159. 175. 191. 207. 223.
      239. 255.]]
  • Example of int32_t (32-bit) call for transposing 2D matrix data blocks
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    #include "kernel_operator.h"
    
    class KernelTransDataTo5HD {
    public:
        __aicore__ inline KernelTransDataTo5HD() {}
        __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ int32_t *)src);
            dstGlobal.SetGlobalBuffer((__gm__ int32_t *)dstGm);
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int32_t));
            pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int32_t));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.AllocTensor<int32_t>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.DeQue<int32_t>();
            AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>();
            AscendC::TransDataTo5HDParams transDataParams;
            transDataParams.dstHighHalf = false;
            transDataParams.srcHighHalf = false;
            transDataParams.repeatTimes = 1;
            transDataParams.dstRepStride = 0;
            transDataParams.srcRepStride = 0;
            // Call mode when the input is LocalTensor
            AscendC::LocalTensor<int32_t> dstLocalList[16];
            for (int i = 0; i < 16; i++) {
                dstLocalList[i] = dstLocal[width * i];
            }
            AscendC::LocalTensor<int32_t> srcLocalList[16];
            for (int i = 0; i < 16; i++) {
                srcLocalList[i] = srcLocal[width * i];
            }
            AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is LocalTensor address value
            // uint64_t dstLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr());
            // }
            // uint64_t srcLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //    srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr());
            // }
            // AscendC::TransDataTo5HD<int32_t>(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is address LocalTensor
            // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr()));
            // }
            // AscendC::TransDataTo5HD<int32_t>(dst, src, transDataParams);
            // workQueueSrc1.FreeTensor(dst);
            // workQueueSrc2.FreeTensor(src);
            outQueueDst.EnQue<int32_t>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<int32_t> srcGlobal, dstGlobal;
        int srcDataSize = 128;
        int dstDataSize = 128;
        int width = 8;
    };
    
    extern "C" __global__ __aicore__ void trans5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        KernelTransDataTo5HD op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input (src):
    [  0   1   2   3   4   5   6   7  
       8   9  10  11  12  13  14  15  
      16  17  18  19  20  21  22  23
      24  25  26  27  28  29  30  31 
      32  33  34  35  36  37  38  39
      40  41  42  43  44  45  46  47  
      48  49  50  51  52  53  54  55
      56  57  58  59  60  61  62  63
      64  65  66  67  68  69  70  71
      72  73  74  75  76  77  78  79  
      80  81  82  83  84  85  86  87
      88  89  90  91  92  93  94  95  
      96  97  98  99  100 101 102 103
      104 105 106 107 108 109 110 111 
      112 113 114 115 116 117 118 119
      120 121 122 123 124 125  126 127]
    Output (dstGm):
     [0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120
      1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 
      2 10 18 26 34 42 50 58 66 74 82 90 98 106 114 122 
      3 11 19 27 35 43 51 59 67 75 83 91 99 107 115 123 
      4 12 20 28 36 44 52 60 68 76 84 92 100 108 116 124 
      5 13 21 29 37 45 53 61 69 77 85 93 101 109 117 125 
      6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126
      7 15 23 31 39 47 55 63 71 79 87 95 103 111 119 127]