TransDataTo5HD
Function Usage
Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block. Compared with Transpose that supports only 16 x 16 matrix transpose, this API can process 512-byte data (16 data blocks) in a single repeat, and supports multiple repeats and matrix transposing of different shapes depending on the data type. For example, when the data type is half, 16 x 16 matrix transpose can be completed in a single repeat.
The conversion rules in a single repeat are as follows:
- When the input data type is int16_t, uint16_t, or half, each data block contains 16 numbers. The instruction loops for 16 times. In each iteration, values are obtained from the corresponding locations in the specified 16 data blocks and a new data block is formed and stored in the destination address. As shown in the following figure, srcLocalList[0]-srcLocalList[15] indicates 16 data blocks of the source operand.
Figure 1 Conversion rule when the input data type is int16_t, uint16_t, or half
- When the data type is float, int32_t, or uint32_t, each data block contains eight data blocks. The instruction loops for eight times. In each iteration, values are obtained from the corresponding locations in the specified 16 data blocks and two new data blocks are formed and stored in the destination address. See the following figure.
Figure 2 Conversion rule when the input data type is float, int32_t, or uint32_t
- When the data type is int8_t or uint8_t, each data block contains 32 pieces of data. The instruction loops for 16 times. In each iteration, values are obtained from the corresponding locations in the specified 16 data blocks and a half data block is formed and stored in the destination address. The srcHighHalf and dstHighHalf parameters determine whether data is read and stored in the high half or low half of a data block. See the following figure.
Figure 3 Conversion rule when the input data type is int8_t or uint8_t
Based on the preceding conversion rules, this API is used to convert the NC1HWC0 format or transpose a matrix. The NC1HWC0 format conversion is complex. The following describes how to convert the NC1HWC0 format:
When the NCHW format is converted to the NC1HWC0 format, if the data type is float, int32_t, uint32_t, int16_t, unint16_t or half, C0 is 16; if the data type is uint8_t or int8_t, C0 is 32. The following figure uses C0 = 16 as an example.

Prototype
- An array consisting of dstLocalList and srcLocalList of type LocaTensor:
1 2 3
// The value of NCHW_CONV_ADDR_LIST_SIZE is 16. template <typename T> __aicore__ inline void TransDataTo5HD(const LocalTensor<T> (&dstLocalList)[NCHW_CONV_ADDR_LIST_SIZE], const LocalTensor<T> (&srcLocalList)[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)
- An array consisting of dstList and srcList of type uint64_t. The array elements correspond to the LocaTensor address values. This API has better performance. You can obtain the address value by calling GetPhyAddr of the LocalTensor.
1 2 3
// The value of NCHW_CONV_ADDR_LIST_SIZE is 16. template<typename T> __aicore__ inline void TransDataTo5HD(uint64_t dstList[NCHW_CONV_ADDR_LIST_SIZE], uint64_t srcList[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)
- dstLocal and srcLocal are LocalTensors of the uint64_t type. They continuously store the address values of the corresponding LocalTensors. You can obtain the address value by calling GetPhyAddr of the LocalTensor.
1 2
template <typename T> __aicore__ inline void TransDataTo5HD(const LocalTensor<uint64_t>& dstLocal, const LocalTensor<uint64_t>& srcLocal, const TransDataTo5HDParams& nchwconvParams)
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Operand data type. For the |
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dstLocalList/dstList |
Output |
Address sequence of the destination operand. The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T. |
|
srcLocalList/srcList |
Input |
Address sequence of the source operand. The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T. The data type must be the same as that of dstLocalList/dstList. |
|
dstLocal |
Output |
Destination operand. The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
srcLocal |
Input |
Source operand. The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
nchwconvParams |
Input |
Data structure for controlling TransdataTo5HD. The structure contains parameters such as the control parameters of the read and write positions, number of iterations, and address stride between adjacent iterations. For details about the definition of this data structure, see Table 3. |
Availability
Precautions
- During the conversion from the NCHW format to the NC1HWC0 format, each element in srcLocalList/dstLocalList is configured as the start point of each HW plane.
- For better performance, it is recommended that dstHighHalf and srcHighHalf be fixed when the data type is int8_t or uint8_t, and be changed after the repeat in the HW direction.
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- For a single repeat (repeatTimes = 1), the source operand sequence and the target operand sequence must be completely the same. Partial overlapping is not supported. Instead, each block must be the same.
- For multiple repeats (repeatTimes > 1), if there is a dependency between the source operand sequence and the destination operand sequence, that is, the destination operand of the Nth iteration is the source operand of the (N + 1)th iteration, address overlapping is not allowed.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
- The addresses in dstLocal and srcLocal must be stored consecutively. For details, see the calling example.
Returns
None
Example
- The following is an example of converting the NCHW format to the NC1HWC0 format. The input data is of the half type, the NCHW format is (2, 32, 16, 16), and the target NC1HWC0 format is (2, 2, 16, 16, 16).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
#include "kernel_operator.h" class KernelTransDataTo5HD { public: __aicore__ inline KernelTransDataTo5HD() {} __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { srcGlobal.SetGlobalBuffer((__gm__ half *)src); dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half)); pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); AscendC::TransDataTo5HDParams transDataParams; transDataParams.dstHighHalf = false; transDataParams.srcHighHalf = false; transDataParams.repeatTimes = 16; transDataParams.dstRepStride = 16; transDataParams.srcRepStride = 1; for(int j = 0; j < 4; j++) { // // Call mode when the input is LocalTensor // AscendC::LocalTensor<half> dstLocalList[16]; // for (int i = 0; i < 16; i++) { // dstLocalList[i] = dstLocal[j * c0size * height * width + width * i]; // } // AscendC::LocalTensor<half> srcLocalList[16]; // for (int i = 0; i < 16; i++) { // srcLocalList[i] = srcLocal[j * c0size * height * width + height * width * i]; // } // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams); // (Recommended) Call mode when the input is LocalTensor address value uint64_t dstLocalList[16]; for (int i = 0; i < 16; i++) { dstLocalList[i] = (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr()); } uint64_t srcLocalList[16]; for (int i = 0; i < 16; i++) { srcLocalList[i] = (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr()); } AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is address LocalTensor // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // dst.SetValue(i, (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr())); // } // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // src.SetValue(i, (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr())); // } // AscendC::TransDataTo5HD<half>(dst, src, transDataParams); // workQueueSrc1.FreeTensor(dst); // workQueueSrc2.FreeTensor(src); } outQueueDst.EnQue<half>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc1; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc2; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal, dstGlobal; int srcDataSize = 16384; int dstDataSize = 16384; int width = 16; // H int height = 16; // W int c0size = 16; // C0 }; extern "C" __global__ __aicore__ void vec_transdata5hd_b16_nchw2nc1hwc0(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { KernelTransDataTo5HD op; op.Init(src, dstGm); op.Process(); }
Input: [[[[ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[ 1. 1. 1. ... 1. 1. 1.] [ 1. 1. 1. ... 1. 1. 1.] [ 1. 1. 1. ... 1. 1. 1.] ... [ 1. 1. 1. ... 1. 1. 1.] [ 1. 1. 1. ... 1. 1. 1.] [ 1. 1. 1. ... 1. 1. 1.]] [[ 2. 2. 2. ... 2. 2. 2.] [ 2. 2. 2. ... 2. 2. 2.] [ 2. 2. 2. ... 2. 2. 2.] ... [ 2. 2. 2. ... 2. 2. 2.] [ 2. 2. 2. ... 2. 2. 2.] [ 2. 2. 2. ... 2. 2. 2.]] ... [[29. 29. 29. ... 29. 29. 29.] [29. 29. 29. ... 29. 29. 29.] [29. 29. 29. ... 29. 29. 29.] ... [29. 29. 29. ... 29. 29. 29.] [29. 29. 29. ... 29. 29. 29.] [29. 29. 29. ... 29. 29. 29.]] [[30. 30. 30. ... 30. 30. 30.] [30. 30. 30. ... 30. 30. 30.] [30. 30. 30. ... 30. 30. 30.] ... [30. 30. 30. ... 30. 30. 30.] [30. 30. 30. ... 30. 30. 30.] [30. 30. 30. ... 30. 30. 30.]] [[31. 31. 31. ... 31. 31. 31.] [31. 31. 31. ... 31. 31. 31.] [31. 31. 31. ... 31. 31. 31.] ... [31. 31. 31. ... 31. 31. 31.] [31. 31. 31. ... 31. 31. 31.] [31. 31. 31. ... 31. 31. 31.]]] [[[32. 32. 32. ... 32. 32. 32.] [32. 32. 32. ... 32. 32. 32.] [32. 32. 32. ... 32. 32. 32.] ... [32. 32. 32. ... 32. 32. 32.] [32. 32. 32. ... 32. 32. 32.] [32. 32. 32. ... 32. 32. 32.]] [[33. 33. 33. ... 33. 33. 33.] [33. 33. 33. ... 33. 33. 33.] [33. 33. 33. ... 33. 33. 33.] ... [33. 33. 33. ... 33. 33. 33.] [33. 33. 33. ... 33. 33. 33.] [33. 33. 33. ... 33. 33. 33.]] [[34. 34. 34. ... 34. 34. 34.] [34. 34. 34. ... 34. 34. 34.] [34. 34. 34. ... 34. 34. 34.] ... [34. 34. 34. ... 34. 34. 34.] [34. 34. 34. ... 34. 34. 34.] [34. 34. 34. ... 34. 34. 34.]] ... [[61. 61. 61. ... 61. 61. 61.] [61. 61. 61. ... 61. 61. 61.] [61. 61. 61. ... 61. 61. 61.] ... [61. 61. 61. ... 61. 61. 61.] [61. 61. 61. ... 61. 61. 61.] [61. 61. 61. ... 61. 61. 61.]] [[62. 62. 62. ... 62. 62. 62.] [62. 62. 62. ... 62. 62. 62.] [62. 62. 62. ... 62. 62. 62.] ... [62. 62. 62. ... 62. 62. 62.] [62. 62. 62. ... 62. 62. 62.] [62. 62. 62. ... 62. 62. 62.]] [[63. 63. 63. ... 63. 63. 63.] [63. 63. 63. ... 63. 63. 63.] [63. 63. 63. ... 63. 63. 63.] ... [63. 63. 63. ... 63. 63. 63.] [63. 63. 63. ... 63. 63. 63.] [63. 63. 63. ... 63. 63. 63.]]]] Return: [[[[[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] ... [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]]] [[[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] ... [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]]]] [[[[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] ... [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]]] [[[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] ... [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]]]]]
- Example of int8_t (8-bit) call for transposing 2D matrix data blocks
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
#include "kernel_operator.h" class KernelTransDataTo5HD { public: __aicore__ inline KernelTransDataTo5HD() {} __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { srcGlobal.SetGlobalBuffer((__gm__ int8_t *)src); dstGlobal.SetGlobalBuffer((__gm__ int8_t *)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int8_t)); pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int8_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.AllocTensor<int8_t>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.DeQue<int8_t>(); AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.AllocTensor<int8_t>(); for(int i = 0;i<dstDataSize; i++){ dstLocal.SetValue(i,0); } AscendC::TransDataTo5HDParams transDataParams; // Write data to the upper half of dstLocalList. transDataParams.dstHighHalf = 1; // Read data from the upper half of srcLocalList. transDataParams.srcHighHalf = 1; transDataParams.repeatTimes = 1; transDataParams.dstRepStride = 0; transDataParams.srcRepStride = 0; // Call mode when the input is LocalTensor AscendC::LocalTensor<int8_t> dstLocalList[16]; for (int i = 0; i < 16; i++) { dstLocalList[i] = dstLocal[width * i]; } AscendC::LocalTensor<int8_t> srcLocalList[16]; for (int i = 0; i < 16; i++) { srcLocalList[i] = srcLocal[width * i]; } AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is LocalTensor address value // uint64_t dstLocalList[16]; // for (int i = 0; i < 16; i++) { // dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr()); // } // uint64_t srcLocalList[16]; // for (int i = 0; i < 16; i++) { // srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr()); // } // AscendC::TransDataTo5HD<int8_t>(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is address LocalTensor // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr())); // } // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr())); // } // AscendC::TransDataTo5HD<int8_t>(dst, src, transDataParams); // workQueueSrc1.FreeTensor(dst); // workQueueSrc2.FreeTensor(src); outQueueDst.EnQue<int8_t>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.DeQue<int8_t>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc1; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc2; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<int8_t> srcGlobal, dstGlobal; int srcDataSize = 512; int dstDataSize = 512; int width = 32; }; extern "C" __global__ __aicore__ void transdata5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { KernelTransDataTo5HD op; op.Init(src, dstGm); op.Process(); }
Input: [[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] [ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] [ 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95] [ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] [ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] [ 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95] [ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] [ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] [ 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95] [ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] [ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] [ 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95] [ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127]] Output: // Read data from the upper half of the input data and write data to the upper half of the output data. [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 48 80 112 16 48 80 112 16 48 80 112 16 48 80 112 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 49 81 113 17 49 81 113 17 49 81 113 17 49 81 113 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 50 82 114 18 50 82 114 18 50 82 114 18 50 82 114 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 51 83 115 19 51 83 115 19 51 83 115 19 51 83 115 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 52 84 116 20 52 84 116 20 52 84 116 20 52 84 116 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 53 85 117 21 53 85 117 21 53 85 117 21 53 85 117 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 54 86 118 22 54 86 118 22 54 86 118 22 54 86 118 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 55 87 119 23 55 87 119 23 55 87 119 23 55 87 119 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 56 88 120 24 56 88 120 24 56 88 120 24 56 88 120 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 57 89 121 25 57 89 121 25 57 89 121 25 57 89 121 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 58 90 122 26 58 90 122 26 58 90 122 26 58 90 122 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 59 91 123 27 59 91 123 27 59 91 123 27 59 91 123 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 60 92 124 28 60 92 124 28 60 92 124 28 60 92 124 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 61 93 125 29 61 93 125 29 61 93 125 29 61 93 125 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 62 94 126 30 62 94 126 30 62 94 126 30 62 94 126 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31 63 95 127 31 63 95 127 31 63 95 127 31 63 95 127 ]]
- Example of half (16-bit) call for transposing 2D matrix data blocks
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
#include "kernel_operator.h" class KernelTransDataTo5HD { public: __aicore__ inline KernelTransDataTo5HD() {} __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { srcGlobal.SetGlobalBuffer((__gm__ half *)src); dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half)); pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); AscendC::TransDataTo5HDParams transDataParams; transDataParams.dstHighHalf = false; transDataParams.srcHighHalf = false; transDataParams.repeatTimes = 1; transDataParams.dstRepStride = 0; transDataParams.srcRepStride = 0; // Call mode when the input is LocalTensor AscendC::LocalTensor<half> dstLocalList[16]; for (int i = 0; i < 16; i++) { dstLocalList[i] = dstLocal[width * i]; } AscendC::LocalTensor<half> srcLocalList[16]; for (int i = 0; i < 16; i++) { srcLocalList[i] = srcLocal[width * i]; } AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is LocalTensor address value // uint64_t dstLocalList[16]; // for (int i = 0; i < 16; i++) { // dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr()); // } // uint64_t srcLocalList[16]; // for (int i = 0; i < 16; i++) { // srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr()); // } // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is address LocalTensor // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr())); // } // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr())); // } // AscendC::TransDataTo5HD<half>(dst, src, transDataParams); // workQueueSrc1.FreeTensor(dst); // workQueueSrc2.FreeTensor(src); outQueueDst.EnQue<half>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc1; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc2; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal, dstGlobal; int srcDataSize = 256; int dstDataSize = 256; int width = 16; }; extern "C" __global__ __aicore__ void nchwconv_demo_first(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { KernelTransDataTo5HD op; op.Init(src, dstGm); op.Process(); }
Input (src): [[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.] [ 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [ 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47.] [ 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63.] [ 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79.] [ 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95.] [ 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.] [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127.] [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143.] [144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159.] [160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175.] [176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189. 190. 191.] [192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207.] [208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223.] [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237. 238. 239.] [240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253. 254. 255.]] Output (dstGm): [[ 0. 16. 32. 48. 64. 80. 96. 112. 128. 144. 160. 176. 192. 208. 224. 240.] [ 1. 17. 33. 49. 65. 81. 97. 113. 129. 145. 161. 177. 193. 209. 225. 241.] [ 2. 18. 34. 50. 66. 82. 98. 114. 130. 146. 162. 178. 194. 210. 226. 242.] [ 3. 19. 35. 51. 67. 83. 99. 115. 131. 147. 163. 179. 195. 211. 227. 243.] [ 4. 20. 36. 52. 68. 84. 100. 116. 132. 148. 164. 180. 196. 212. 228. 244.] [ 5. 21. 37. 53. 69. 85. 101. 117. 133. 149. 165. 181. 197. 213. 229. 245.] [ 6. 22. 38. 54. 70. 86. 102. 118. 134. 150. 166. 182. 198. 214. 230. 246.] [ 7. 23. 39. 55. 71. 87. 103. 119. 135. 151. 167. 183. 199. 215. 231. 247.] [ 8. 24. 40. 56. 72. 88. 104. 120. 136. 152. 168. 184. 200. 216. 232. 248.] [ 9. 25. 41. 57. 73. 89. 105. 121. 137. 153. 169. 185. 201. 217. 233. 249.] [ 10. 26. 42. 58. 74. 90. 106. 122. 138. 154. 170. 186. 202. 218. 234. 250.] [ 11. 27. 43. 59. 75. 91. 107. 123. 139. 155. 171. 187. 203. 219. 235. 251.] [ 12. 28. 44. 60. 76. 92. 108. 124. 140. 156. 172. 188. 204. 220. 236. 252.] [ 13. 29. 45. 61. 77. 93. 109. 125. 141. 157. 173. 189. 205. 221. 237. 253.] [ 14. 30. 46. 62. 78. 94. 110. 126. 142. 158. 174. 190. 206. 222. 238. 254.] [ 15. 31. 47. 63. 79. 95. 111. 127. 143. 159. 175. 191. 207. 223. 239. 255.]]
- Example of int32_t (32-bit) call for transposing 2D matrix data blocks
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
#include "kernel_operator.h" class KernelTransDataTo5HD { public: __aicore__ inline KernelTransDataTo5HD() {} __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { srcGlobal.SetGlobalBuffer((__gm__ int32_t *)src); dstGlobal.SetGlobalBuffer((__gm__ int32_t *)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int32_t)); pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int32_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.AllocTensor<int32_t>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.DeQue<int32_t>(); AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>(); AscendC::TransDataTo5HDParams transDataParams; transDataParams.dstHighHalf = false; transDataParams.srcHighHalf = false; transDataParams.repeatTimes = 1; transDataParams.dstRepStride = 0; transDataParams.srcRepStride = 0; // Call mode when the input is LocalTensor AscendC::LocalTensor<int32_t> dstLocalList[16]; for (int i = 0; i < 16; i++) { dstLocalList[i] = dstLocal[width * i]; } AscendC::LocalTensor<int32_t> srcLocalList[16]; for (int i = 0; i < 16; i++) { srcLocalList[i] = srcLocal[width * i]; } AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is LocalTensor address value // uint64_t dstLocalList[16]; // for (int i = 0; i < 16; i++) { // dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr()); // } // uint64_t srcLocalList[16]; // for (int i = 0; i < 16; i++) { // srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr()); // } // AscendC::TransDataTo5HD<int32_t>(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is address LocalTensor // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr())); // } // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr())); // } // AscendC::TransDataTo5HD<int32_t>(dst, src, transDataParams); // workQueueSrc1.FreeTensor(dst); // workQueueSrc2.FreeTensor(src); outQueueDst.EnQue<int32_t>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc1; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueueSrc2; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<int32_t> srcGlobal, dstGlobal; int srcDataSize = 128; int dstDataSize = 128; int width = 8; }; extern "C" __global__ __aicore__ void trans5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { KernelTransDataTo5HD op; op.Init(src, dstGm); op.Process(); }
Input (src): [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127] Output (dstGm): [0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 2 10 18 26 34 42 50 58 66 74 82 90 98 106 114 122 3 11 19 27 35 43 51 59 67 75 83 91 99 107 115 123 4 12 20 28 36 44 52 60 68 76 84 92 100 108 116 124 5 13 21 29 37 45 53 61 69 77 85 93 101 109 117 125 6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126 7 15 23 31 39 47 55 63 71 79 87 95 103 111 119 127]