TransDataTo5HD

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	√

Function

Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block. For matrix transpose operations, the Transpose API only supports 16 × 16 matrices. This API can process 512 bytes (16 data blocks) per repeat iteration. It supports matrix transposition for various shapes depending on the data type. For example, a 16 × 16 matrix can be transposed within one repeat when the data type is half. Multiple repeat iterations are also supported.

The conversion rules in a single repeat are as follows:

When the input data type is 16-bit wide, each data block holds 16 elements. The instruction runs 16 internal loops. In each iteration, it fetches values from the corresponding positions of the 16 specified data blocks, assembles a new data block, and writes it to the destination address. As shown in the following figure, srcList[0] to srcList[15] indicate 16 data blocks of the source operand.
Figure 1 Conversion rules when the input data type is 16-bit wide
When the input data type is 32-bit wide, each data block holds 8 elements. The instruction runs 8 internal loops. In each iteration, it fetches values from the corresponding positions of the 16 specified data blocks, assembles two new data blocks, and writes them to the destination address. See the following figure.
Figure 2 Conversion rules when the input data type is 32-bit wide
When the input data type is 8-bit wide, each data block holds 32 elements. The instruction runs 16 internal loops. In each iteration, it fetches values from the corresponding positions of the 16 specified data blocks, assembles half a data block, and writes it to the destination address. The srcHighHalf and dstHighHalf parameters determine whether reading and writing operate on the upper half or lower half of the data block. See the following figure.
Figure 3 Conversion rules when the input data type is 8-bit wide

Based on the preceding conversion rules, this API is used to convert the NC1HWC0 format or transpose a matrix. The NC1HWC0 format conversion is complex. The following describes how to convert the NC1HWC0 format:

When the NCHW format is converted to the NC1HWC0 format, if the bit width of the data type is 32 or 16 bits, C0 is 16; if the bit width of the data type is 8 bits, C0 is 32. The following figure uses C0 = 16 as an example.

Prototype

dstList and srcList are arrays of the LocalTensor type.

        
             // The value of NCHW_CONV_ADDR_LIST_SIZE is 16.
template <typename T>
__aicore__ inline void TransDataTo5HD(const LocalTensor<T> (&dstList)[NCHW_CONV_ADDR_LIST_SIZE], const LocalTensor<T> (&srcList)[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)

An array consisting of dstList and srcList of type uint64_t. The array elements correspond to the LocalTensor address values. This API has better performance. You can obtain the address value by calling GetPhyAddr of the LocalTensor.

        
             // The value of NCHW_CONV_ADDR_LIST_SIZE is 16.
template<typename T>
__aicore__ inline void TransDataTo5HD(uint64_t dstList[NCHW_CONV_ADDR_LIST_SIZE], uint64_t srcList[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)

dst and src are LocalTensors of the uint64_t type, which consecutively store the address values of the corresponding LocalTensor. You can obtain the address value by calling GetPhyAddr of the LocalTensor.

       
            template <typename T>
__aicore__ inline void TransDataTo5HD(const LocalTensor<uint64_t>& dst, const LocalTensor<uint64_t>& src, const TransDataTo5HDParams& nchwconvParams)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Operand data type. For Atlas A3 training products / Atlas A3 inference products , the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float. For Atlas A2 training products / Atlas A2 inference products , the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float. For Atlas 200I/500 A2 inference products , the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float. For the Atlas inference product 's AI Core, the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float. For the Atlas training products , the supported data types are int8_t, uint8_t, int16_t, uint16_t, and half.

**Table 2** Parameters
Parameter	Input/Output	Meaning
dstList	Output	Address sequence of the destination operand. The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T.
srcList	Input	Address sequence of the source operand. The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T. Its data type must be the same as that of dstList.
dst	Output	Destination operand. The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned.
src	Input	Source operand. The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned.
nchwconvParams	Input	Data structure for controlling TransdataTo5HD. The structure contains parameters including read/write position controls, iteration count, and address stride between adjacent iterations. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_transpose.h. Replace ${INSTALL_DIR} with the CANN installation path. For details about the parameters, see Table 3.

**Table 3** Parameters in the TransDataTo5HDParams structure
Parameter	Input/Output	Meaning
dstHighHalf	Input	Whether to store the data of dstList to the upper or lower half of the data block. Only the int8_t and uint8_t data types are supported. Selected from: True: upper half False: lower half
srcHighHalf	Input	Whether to read the data of srcList from the upper or lower half of the data block. Only the int8_t and uint8_t data types are supported. Selected from: True: Data is read from the upper half of a data block. False: Data is read from the lower half of a data block.
repeatTimes	Input	Number of iteration repeats. repeatTimes ∈ [0,255]. For details about this parameter, see High-dimensional Sharding APIs. Notes: When repeatTimes is 1, the effective start position of destination and source operands is calculated as the input start position of dstList/srcList plus dstRepStride/srcRepStride. Set dstRepStride/srcRepStride to 0 if you need the effective start position to match the original start position of dstList/srcList. When repeatTimes is greater than 1, the effective start position of the destination and source operands in the first repeat is the input start position of dstList/srcList. The second repeat requires adding dstRepStride/srcRepStride. This rule applies.
dstRepStride	Input	Datablock-to-datablock stride between adjacent repeats of the destination operand, in the unit of data blocks. For details, see repeatStride.
srcRepStride	Input	Datablock-to-datablock stride between adjacent repeats of the source operand, in the unit of data blocks. For details, see repeatStride.

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
For details about the constraints on operand address overlapping, see General Address Overlapping Restrictions.

For conversion from the NCHW format to NC1HWC0, each element in srcList and dstList shall be configured as the starting address of each HW plane.
For optimal performance with 8-bit data types, keep dstHighHalf and srcHighHalf unchanged during HW-direction repeats, then modify them afterwards.
All addresses in dst and src must be stored contiguously. For details, see the calling example.

Returns

None

Example

The following is an example of converting the NCHW format to the NC1HWC0 format. The input data is of the half type, the NCHW format is (2, 32, 16, 16), and the target NC1HWC0 format is (2, 2, 16, 16, 16).

        
         
           
           
             #include "kernel_operator.h"

class KernelTransDataTo5HD {
public:
    __aicore__ inline KernelTransDataTo5HD() {}
    __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half *)src);
        dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
        pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
        pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
        pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::TransDataTo5HDParams transDataParams;
        transDataParams.dstHighHalf = false;
        transDataParams.srcHighHalf = false;
        transDataParams.repeatTimes = 16;
        transDataParams.dstRepStride = 16;
        transDataParams.srcRepStride = 1;
        for(int j = 0; j < 4; j++) {
            // // Call mode when the input is LocalTensor
            // AscendC::LocalTensor<half> dstLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //     dstLocalList[i] = dstLocal[j * c0size * height * width + width * i];
            // }
            // AscendC::LocalTensor<half> srcLocalList[16];
            // for (int i = 0; i < 16; i++) {
            //     srcLocalList[i] = srcLocal[j * c0size * height * width + height * width * i];
            // }
            // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams);
            // (Recommended) Call mode when the input is LocalTensor address value
            uint64_t dstLocalList[16];
            for (int i = 0; i < 16; i++) {
               dstLocalList[i] = (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr());
            }
            uint64_t srcLocalList[16];
            for (int i = 0; i < 16; i++) {
               srcLocalList[i] = (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr());
            }
            AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams);
            // // Call mode when the input is address LocalTensor
            // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     dst.SetValue(i, (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr()));
            // }
            // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
            // for (int i = 0; i < 16; i++) {
            //     src.SetValue(i, (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr()));
            // }
            // AscendC::TransDataTo5HD<half>(dst, src, transDataParams);
            // workQueueSrc1.FreeTensor(dst);
            // workQueueSrc2.FreeTensor(src);
        }
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
    int srcDataSize = 16384;
    int dstDataSize = 16384;
    int width = 16; // H
    int height = 16; // W
    int c0size = 16; // C0
};

extern "C" __global__ __aicore__ void vec_transdata5hd_b16_nchw2nc1hwc0(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
{
    KernelTransDataTo5HD op;
    op.Init(src, dstGm);
    op.Process();
}

            

          

        
       

Input:
[[[[ 0.  0.  0. ...  0.  0.  0.]
   [ 0.  0.  0. ...  0.  0.  0.]
   [ 0.  0.  0. ...  0.  0.  0.]
   ...
   [ 0.  0.  0. ...  0.  0.  0.]
   [ 0.  0.  0. ...  0.  0.  0.]
   [ 0.  0.  0. ...  0.  0.  0.]]

  [[ 1.  1.  1. ...  1.  1.  1.]
   [ 1.  1.  1. ...  1.  1.  1.]
   [ 1.  1.  1. ...  1.  1.  1.]
   ...
   [ 1.  1.  1. ...  1.  1.  1.]
   [ 1.  1.  1. ...  1.  1.  1.]
   [ 1.  1.  1. ...  1.  1.  1.]]

  [[ 2.  2.  2. ...  2.  2.  2.]
   [ 2.  2.  2. ...  2.  2.  2.]
   [ 2.  2.  2. ...  2.  2.  2.]
   ...
   [ 2.  2.  2. ...  2.  2.  2.]
   [ 2.  2.  2. ...  2.  2.  2.]
   [ 2.  2.  2. ...  2.  2.  2.]]

  ...

  [[29. 29. 29. ... 29. 29. 29.]
   [29. 29. 29. ... 29. 29. 29.]
   [29. 29. 29. ... 29. 29. 29.]
   ...
   [29. 29. 29. ... 29. 29. 29.]
   [29. 29. 29. ... 29. 29. 29.]
   [29. 29. 29. ... 29. 29. 29.]]

  [[30. 30. 30. ... 30. 30. 30.]
   [30. 30. 30. ... 30. 30. 30.]
   [30. 30. 30. ... 30. 30. 30.]
   ...
   [30. 30. 30. ... 30. 30. 30.]
   [30. 30. 30. ... 30. 30. 30.]
   [30. 30. 30. ... 30. 30. 30.]]

  [[31. 31. 31. ... 31. 31. 31.]
   [31. 31. 31. ... 31. 31. 31.]
   [31. 31. 31. ... 31. 31. 31.]
   ...
   [31. 31. 31. ... 31. 31. 31.]
   [31. 31. 31. ... 31. 31. 31.]
   [31. 31. 31. ... 31. 31. 31.]]]

 [[[32. 32. 32. ... 32. 32. 32.]
   [32. 32. 32. ... 32. 32. 32.]
   [32. 32. 32. ... 32. 32. 32.]
   ...
   [32. 32. 32. ... 32. 32. 32.]
   [32. 32. 32. ... 32. 32. 32.]
   [32. 32. 32. ... 32. 32. 32.]]

  [[33. 33. 33. ... 33. 33. 33.]
   [33. 33. 33. ... 33. 33. 33.]
   [33. 33. 33. ... 33. 33. 33.]
   ...
   [33. 33. 33. ... 33. 33. 33.]
   [33. 33. 33. ... 33. 33. 33.]
   [33. 33. 33. ... 33. 33. 33.]]

  [[34. 34. 34. ... 34. 34. 34.]
   [34. 34. 34. ... 34. 34. 34.]
   [34. 34. 34. ... 34. 34. 34.]
   ...
   [34. 34. 34. ... 34. 34. 34.]
   [34. 34. 34. ... 34. 34. 34.]
   [34. 34. 34. ... 34. 34. 34.]]

  ...

  [[61. 61. 61. ... 61. 61. 61.]
   [61. 61. 61. ... 61. 61. 61.]
   [61. 61. 61. ... 61. 61. 61.]
   ...
   [61. 61. 61. ... 61. 61. 61.]
   [61. 61. 61. ... 61. 61. 61.]
   [61. 61. 61. ... 61. 61. 61.]]

  [[62. 62. 62. ... 62. 62. 62.]
   [62. 62. 62. ... 62. 62. 62.]
   [62. 62. 62. ... 62. 62. 62.]
   ...
   [62. 62. 62. ... 62. 62. 62.]
   [62. 62. 62. ... 62. 62. 62.]
   [62. 62. 62. ... 62. 62. 62.]]

  [[63. 63. 63. ... 63. 63. 63.]
   [63. 63. 63. ... 63. 63. 63.]
   [63. 63. 63. ... 63. 63. 63.]
   ...
   [63. 63. 63. ... 63. 63. 63.]
   [63. 63. 63. ... 63. 63. 63.]
   [63. 63. 63. ... 63. 63. 63.]]]]
Return:
[[[[[ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    ...
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]]

   [[ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    ...
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]]

   [[ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    ...
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]]

   ...

   [[ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    ...
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]]

   [[ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    ...
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]]

   [[ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    ...
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]
    [ 0.  1.  2. ... 13. 14. 15.]]]

  [[[16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    ...
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]]

   [[16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    ...
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]]

   [[16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    ...
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]]

   ...

   [[16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    ...
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]]

   [[16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    ...
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]]

   [[16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    ...
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]
    [16. 17. 18. ... 29. 30. 31.]]]]

 [[[[32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    ...
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]]

   [[32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    ...
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]]

   [[32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    ...
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]]

   ...

   [[32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    ...
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]]

   [[32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    ...
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]]

   [[32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    ...
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]
    [32. 33. 34. ... 45. 46. 47.]]]


  [[[48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    ...
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]]

   [[48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    ...
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]]

   [[48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    ...
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]]

   ...

   [[48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    ...
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]]

   [[48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    ...
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]]

   [[48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    ...
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]
    [48. 49. 50. ... 61. 62. 63.]]]]]

Example of int8_t (8-bit) call for transposing 2D matrix data blocks

        
         
           
           
             #include "kernel_operator.h"

class KernelTransDataTo5HD {
public:
    __aicore__ inline KernelTransDataTo5HD() {}
    __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ int8_t *)src);
        dstGlobal.SetGlobalBuffer((__gm__ int8_t *)dstGm);
        pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int8_t));
        pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
        pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int8_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.AllocTensor<int8_t>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.DeQue<int8_t>();
        AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.AllocTensor<int8_t>();
        for(int i = 0;i<dstDataSize; i++){
            dstLocal.SetValue(i,0);
        }
        AscendC::TransDataTo5HDParams transDataParams;
        // Write data to the upper half of dstLocalList.
        transDataParams.dstHighHalf = 1;
        // Read data from the upper half of srcLocalList.                                                            
        transDataParams.srcHighHalf = 1;
        transDataParams.repeatTimes = 1;
        transDataParams.dstRepStride = 0;
        transDataParams.srcRepStride = 0;
        // Call mode when the input is LocalTensor
        AscendC::LocalTensor<int8_t> dstLocalList[16];
        for (int i = 0; i < 16; i++) {
            dstLocalList[i] = dstLocal[width * i];
        }
        AscendC::LocalTensor<int8_t> srcLocalList[16];
        for (int i = 0; i < 16; i++) {
            srcLocalList[i] = srcLocal[width * i];
        }
        AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams);
        // // Call mode when the input is LocalTensor address value
        // uint64_t dstLocalList[16];
        // for (int i = 0; i < 16; i++) {
        //    dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr());
        // }
        // uint64_t srcLocalList[16];
        // for (int i = 0; i < 16; i++) {
        //    srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr());
        // }
        // AscendC::TransDataTo5HD<int8_t>(dstLocalList, srcLocalList, transDataParams);
        // // Call mode when the input is address LocalTensor
        // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
        // for (int i = 0; i < 16; i++) {
        //     dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr()));
        // }
        // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
        // for (int i = 0; i < 16; i++) {
        //     src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr()));
        // }
        // AscendC::TransDataTo5HD<int8_t>(dst, src, transDataParams);
        // workQueueSrc1.FreeTensor(dst);
        // workQueueSrc2.FreeTensor(src);
        outQueueDst.EnQue<int8_t>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.DeQue<int8_t>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<int8_t> srcGlobal, dstGlobal;
    int srcDataSize = 512;
    int dstDataSize = 512;
    int width = 32;
};

extern "C" __global__ __aicore__ void transdata5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
{
    KernelTransDataTo5HD op;
    op.Init(src, dstGm);
    op.Process();
}

            

          

        
       

Input:
[[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
   18  19  20  21  22  23  24  25  26  27  28  29  30  31]
 [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
   50  51  52  53  54  55  56  57  58  59  60  61  62  63]
 [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
   82  83  84  85  86  87  88  89  90  91  92  93  94  95]
 [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
  114 115 116 117 118 119 120 121 122 123 124 125 126 127]
 [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
   18  19  20  21  22  23  24  25  26  27  28  29  30  31]
 [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
   50  51  52  53  54  55  56  57  58  59  60  61  62  63]
 [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
   82  83  84  85  86  87  88  89  90  91  92  93  94  95]
 [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
  114 115 116 117 118 119 120 121 122 123 124 125 126 127]
 [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
   18  19  20  21  22  23  24  25  26  27  28  29  30  31]
 [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
   50  51  52  53  54  55  56  57  58  59  60  61  62  63]
 [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
   82  83  84  85  86  87  88  89  90  91  92  93  94  95]
 [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
  114 115 116 117 118 119 120 121 122 123 124 125 126 127]
 [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
   18  19  20  21  22  23  24  25  26  27  28  29  30  31]
 [ 32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49
   50  51  52  53  54  55  56  57  58  59  60  61  62  63]
 [ 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81
   82  83  84  85  86  87  88  89  90  91  92  93  94  95]
 [ 96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
  114 115 116 117 118 119 120 121 122 123 124 125 126 127]]
Output:
// Read data from the upper half of the input data and write data to the upper half of the output data.
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 48 80 112 16 48 80 112 16 48 80 112 16 48 80 112 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 49 81 113 17 49 81 113 17 49 81 113 17 49 81 113 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 50 82 114 18 50 82 114 18 50 82 114 18 50 82 114 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 51 83 115 19 51 83 115 19 51 83 115 19 51 83 115 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 52 84 116 20 52 84 116 20 52 84 116 20 52 84 116 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 53 85 117 21 53 85 117 21 53 85 117 21 53 85 117 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 54 86 118 22 54 86 118 22 54 86 118 22 54 86 118 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 55 87 119 23 55 87 119 23 55 87 119 23 55 87 119 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 56 88 120 24 56 88 120 24 56 88 120 24 56 88 120 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 57 89 121 25 57 89 121 25 57 89 121 25 57 89 121 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 58 90 122 26 58 90 122 26 58 90 122 26 58 90 122 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 59 91 123 27 59 91 123 27 59 91 123 27 59 91 123 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 60 92 124 28 60 92 124 28 60 92 124 28 60 92 124 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 61 93 125 29 61 93 125 29 61 93 125 29 61 93 125 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 62 94 126 30 62 94 126 30 62 94 126 30 62 94 126 ]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31 63 95 127 31 63 95 127 31 63 95 127 31 63 95 127 ]]

Example of half (16-bit) call for transposing 2D matrix data blocks

        
         
           
           
             #include "kernel_operator.h"

class KernelTransDataTo5HD {
public:
    __aicore__ inline KernelTransDataTo5HD() {}
    __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half *)src);
        dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
        pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
        pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
        pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::TransDataTo5HDParams transDataParams;
        transDataParams.dstHighHalf = false;
        transDataParams.srcHighHalf = false;
        transDataParams.repeatTimes = 1;
        transDataParams.dstRepStride = 0;
        transDataParams.srcRepStride = 0;
        // Call mode when the input is LocalTensor
        AscendC::LocalTensor<half> dstLocalList[16];
        for (int i = 0; i < 16; i++) {
            dstLocalList[i] = dstLocal[width * i];
        }
        AscendC::LocalTensor<half> srcLocalList[16];
        for (int i = 0; i < 16; i++) {
            srcLocalList[i] = srcLocal[width * i];
        }
        AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams);
        // // Call mode when the input is LocalTensor address value
        // uint64_t dstLocalList[16];
        // for (int i = 0; i < 16; i++) {
        //    dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr());
        // }
        // uint64_t srcLocalList[16];
        // for (int i = 0; i < 16; i++) {
        //    srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr());
        // }
        // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams);
        // // Call mode when the input is address LocalTensor
        // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
        // for (int i = 0; i < 16; i++) {
        //     dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr()));
        // }
        // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
        // for (int i = 0; i < 16; i++) {
        //     src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr()));
        // }
        // AscendC::TransDataTo5HD<half>(dst, src, transDataParams);
        // workQueueSrc1.FreeTensor(dst);
        // workQueueSrc2.FreeTensor(src);
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
    int srcDataSize = 256;
    int dstDataSize = 256;
    int width = 16;
};

extern "C" __global__ __aicore__ void nchwconv_demo_first(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
{
    KernelTransDataTo5HD op;
    op.Init(src, dstGm);
    op.Process();
}

            

          

        
       

Input (src):
[[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
   14.  15.]
 [ 16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.
   30.  31.]
 [ 32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.
   46.  47.]
 [ 48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.  60.  61.
   62.  63.]
 [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
   78.  79.]
 [ 80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.
   94.  95.]
 [ 96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.
  110. 111.]
 [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
  126. 127.]
 [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
  142. 143.]
 [144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157.
  158. 159.]
 [160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173.
  174. 175.]
 [176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189.
  190. 191.]
 [192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205.
  206. 207.]
 [208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221.
  222. 223.]
 [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
  238. 239.]
 [240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253.
  254. 255.]]

Output (dstGm):
[[  0.  16.  32.  48.  64.  80.  96. 112. 128. 144. 160. 176. 192. 208.
  224. 240.]
 [  1.  17.  33.  49.  65.  81.  97. 113. 129. 145. 161. 177. 193. 209.
  225. 241.]
 [  2.  18.  34.  50.  66.  82.  98. 114. 130. 146. 162. 178. 194. 210.
  226. 242.]
 [  3.  19.  35.  51.  67.  83.  99. 115. 131. 147. 163. 179. 195. 211.
  227. 243.]
 [  4.  20.  36.  52.  68.  84. 100. 116. 132. 148. 164. 180. 196. 212.
  228. 244.]
 [  5.  21.  37.  53.  69.  85. 101. 117. 133. 149. 165. 181. 197. 213.
  229. 245.]
 [  6.  22.  38.  54.  70.  86. 102. 118. 134. 150. 166. 182. 198. 214.
  230. 246.]
 [  7.  23.  39.  55.  71.  87. 103. 119. 135. 151. 167. 183. 199. 215.
  231. 247.]
 [  8.  24.  40.  56.  72.  88. 104. 120. 136. 152. 168. 184. 200. 216.
  232. 248.]
 [  9.  25.  41.  57.  73.  89. 105. 121. 137. 153. 169. 185. 201. 217.
  233. 249.]
 [ 10.  26.  42.  58.  74.  90. 106. 122. 138. 154. 170. 186. 202. 218.
  234. 250.]
 [ 11.  27.  43.  59.  75.  91. 107. 123. 139. 155. 171. 187. 203. 219.
  235. 251.]
 [ 12.  28.  44.  60.  76.  92. 108. 124. 140. 156. 172. 188. 204. 220.
  236. 252.]
 [ 13.  29.  45.  61.  77.  93. 109. 125. 141. 157. 173. 189. 205. 221.
  237. 253.]
 [ 14.  30.  46.  62.  78.  94. 110. 126. 142. 158. 174. 190. 206. 222.
  238. 254.]
 [ 15.  31.  47.  63.  79.  95. 111. 127. 143. 159. 175. 191. 207. 223.
  239. 255.]]

Example of int32_t (32-bit) call for transposing 2D matrix data blocks

        
         
           
           
             #include "kernel_operator.h"

class KernelTransDataTo5HD {
public:
    __aicore__ inline KernelTransDataTo5HD() {}
    __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ int32_t *)src);
        dstGlobal.SetGlobalBuffer((__gm__ int32_t *)dstGm);
        pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int32_t));
        pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t));
        pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t));
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int32_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.AllocTensor<int32_t>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.DeQue<int32_t>();
        AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>();
        AscendC::TransDataTo5HDParams transDataParams;
        transDataParams.dstHighHalf = false;
        transDataParams.srcHighHalf = false;
        transDataParams.repeatTimes = 1;
        transDataParams.dstRepStride = 0;
        transDataParams.srcRepStride = 0;
        // Call mode when the input is LocalTensor
        AscendC::LocalTensor<int32_t> dstLocalList[16];
        for (int i = 0; i < 16; i++) {
            dstLocalList[i] = dstLocal[width * i];
        }
        AscendC::LocalTensor<int32_t> srcLocalList[16];
        for (int i = 0; i < 16; i++) {
            srcLocalList[i] = srcLocal[width * i];
        }
        AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams);
        // // Call mode when the input is LocalTensor address value
        // uint64_t dstLocalList[16];
        // for (int i = 0; i < 16; i++) {
        //    dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr());
        // }
        // uint64_t srcLocalList[16];
        // for (int i = 0; i < 16; i++) {
        //    srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr());
        // }
        // AscendC::TransDataTo5HD<int32_t>(dstLocalList, srcLocalList, transDataParams);
        // // Call mode when the input is address LocalTensor
        // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>();
        // for (int i = 0; i < 16; i++) {
        //     dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr()));
        // }
        // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>();
        // for (int i = 0; i < 16; i++) {
        //     src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr()));
        // }
        // AscendC::TransDataTo5HD<int32_t>(dst, src, transDataParams);
        // workQueueSrc1.FreeTensor(dst);
        // workQueueSrc2.FreeTensor(src);
        outQueueDst.EnQue<int32_t>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<int32_t> srcGlobal, dstGlobal;
    int srcDataSize = 128;
    int dstDataSize = 128;
    int width = 8;
};

extern "C" __global__ __aicore__ void trans5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm)
{
    KernelTransDataTo5HD op;
    op.Init(src, dstGm);
    op.Process();
}

            

          

        
       

Input (src):
[  0   1   2   3   4   5   6   7  
   8   9  10  11  12  13  14  15  
  16  17  18  19  20  21  22  23
  24  25  26  27  28  29  30  31 
  32  33  34  35  36  37  38  39
  40  41  42  43  44  45  46  47  
  48  49  50  51  52  53  54  55
  56  57  58  59  60  61  62  63
  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  
  80  81  82  83  84  85  86  87
  88  89  90  91  92  93  94  95  
  96  97  98  99  100 101 102 103
  104 105 106 107 108 109 110 111 
  112 113 114 115 116 117 118 119
  120 121 122 123 124 125  126 127]
Output (dstGm):
 [0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120
  1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 
  2 10 18 26 34 42 50 58 66 74 82 90 98 106 114 122 
  3 11 19 27 35 43 51 59 67 75 83 91 99 107 115 123 
  4 12 20 28 36 44 52 60 68 76 84 92 100 108 116 124 
  5 13 21 29 37 45 53 61 69 77 85 93 101 109 117 125 
  6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126
  7 15 23 31 39 47 55 63 71 79 87 95 103 111 119 127]

Parent topic: Data Conversion