TransDataTo5HD
Applicability
|
Product |
Supported/Unsupported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
√ |
Function Usage
Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block. Compared with Transpose that supports only 16 x 16 matrix transpose, this API can process 512-byte data (16 data blocks) in a single repeat, and supports multiple repeats and matrix transposing of different shapes depending on the data type. For example, when the data type is half, 16 x 16 matrix transpose can be completed in a single repeat.
The conversion rules in a single repeat are as follows:
- When the input data type bit width is 16 bits, each data block contains 16 numbers. The instruction is looped 16 times. In each loop, values are obtained from the corresponding positions in the specified 16 data blocks, and a new data block is formed and stored in the destination address. As shown in the following figure, srcList[0] to srcList[15] indicate 16 data blocks of the source operand.
Figure 1 Conversion rules when the input data type bit width is 16 bits
- When the data type bit width is 32 bits, each data block contains eight numbers. The instruction is looped eight times. In each loop, values are obtained from the corresponding positions in the specified 16 data blocks, and two new data blocks are formed and stored in the destination address. See the following figure.
Figure 2 Conversion rules when the input data type bit width is 32 bits
- When the data type bit width is 8 bits, each data block contains 32 numbers. The instruction is looped 16 times. In each loop, values are obtained from the corresponding positions in the specified 16 data blocks, and half of a data block is formed and stored in the destination address. Whether the read and stored data is in the upper or lower half of the data block is determined by the srcHighHalf and dstHighHalf parameters. See the following figure.
Figure 3 Conversion rules when the input data type bit width is 8 bits
Based on the preceding conversion rules, this API is used to convert the NC1HWC0 format or transpose a matrix. The NC1HWC0 format conversion is complex. The following describes how to convert the NC1HWC0 format:
When the NCHW format is converted to the NC1HWC0 format, if the bit width of the data type is 32 bits or 16 bits, C0 is 16; if the bit width of the data type is 8 bits, C0 is 32. The following figure uses C0 = 16 as an example.

Prototype
- dstList and srcList are arrays of type LocalTensor.
1 2 3
// The value of NCHW_CONV_ADDR_LIST_SIZE is 16. template <typename T> __aicore__ inline void TransDataTo5HD(const LocalTensor<T> (&dstList)[NCHW_CONV_ADDR_LIST_SIZE], const LocalTensor<T> (&srcList)[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)
- An array consisting of dstList and srcList of type uint64_t. The array elements correspond to the LocaTensor address values. This API has better performance. You can obtain the address value by calling GetPhyAddr of the LocalTensor.
1 2 3
// The value of NCHW_CONV_ADDR_LIST_SIZE is 16. template<typename T> __aicore__ inline void TransDataTo5HD(uint64_t dstList[NCHW_CONV_ADDR_LIST_SIZE], uint64_t srcList[NCHW_CONV_ADDR_LIST_SIZE], const TransDataTo5HDParams& nchwconvParams)
- dst and src are LocalTensors of type uint64_t, and the address values of the corresponding LocalTensors are stored continuously. You can obtain the address value by calling GetPhyAddr of the LocalTensor.
1 2
template <typename T> __aicore__ inline void TransDataTo5HD(const LocalTensor<uint64_t>& dst, const LocalTensor<uint64_t>& src, const TransDataTo5HDParams& nchwconvParams)
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Operand data type. For the For the For the For the For the |
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dstList |
Output |
Address sequence of the destination operand. The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T. |
|
srcList |
Input |
Address sequence of the source operand. The type is LocalTensor or the address value of the LocalTensor. The TPosition supported by the LocalTensor is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For details about the supported data types, see the description of the template parameter T. The data type must be the same as that of dstList. |
|
dst |
Output |
Destination operand. The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
src |
Input |
Source operand. The type is LocalTensor. The address value of the corresponding LocalTensor is stored continuously. The TPosition supported by the LocalTensor is VECIN/VECCALC/VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
nchwconvParams |
Input |
Data structure for controlling TransdataTo5HD. The structure contains parameters such as the control parameters of the read and write positions, number of iterations, and address stride between adjacent iterations. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_transpose.h. Replace ${INSTALL_DIR} with the actual CANN component directory. For details about the parameter description, see Table 3. |
Restrictions
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- For details about the operand address overlapping restrictions, see General Address Overlap Restrictions.
- When converting the format from NCHW to NC1HWC0, each element in srcList or dstList is generally configured as the start point of each HW plane.
- To achieve better performance, you are advised to fix dstHighHalf and srcHighHalf first, and then change dstHighHalf and srcHighHalf after the repeat in the HW direction when the bit width of the data type is 8 bits.
- The addresses in dst and src must be stored continuously. For details, see the calling example.
Returns
None
Example
- The following is an example of converting the NCHW format to the NC1HWC0 format. The input data is of the half type, the NCHW format is (2, 32, 16, 16), and the target NC1HWC0 format is (2, 2, 16, 16, 16).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
#include "kernel_operator.h" class KernelTransDataTo5HD { public: __aicore__ inline KernelTransDataTo5HD() {} __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { srcGlobal.SetGlobalBuffer((__gm__ half *)src); dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half)); pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); AscendC::TransDataTo5HDParams transDataParams; transDataParams.dstHighHalf = false; transDataParams.srcHighHalf = false; transDataParams.repeatTimes = 16; transDataParams.dstRepStride = 16; transDataParams.srcRepStride = 1; for(int j = 0; j < 4; j++) { // // Call mode when the input is LocalTensor // AscendC::LocalTensor<half> dstLocalList[16]; // for (int i = 0; i < 16; i++) { // dstLocalList[i] = dstLocal[j * c0size * height * width + width * i]; // } // AscendC::LocalTensor<half> srcLocalList[16]; // for (int i = 0; i < 16; i++) { // srcLocalList[i] = srcLocal[j * c0size * height * width + height * width * i]; // } // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams); // (Recommended) Call mode when the input is LocalTensor address value uint64_t dstLocalList[16]; for (int i = 0; i < 16; i++) { dstLocalList[i] = (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr()); } uint64_t srcLocalList[16]; for (int i = 0; i < 16; i++) { srcLocalList[i] = (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr()); } AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is address LocalTensor // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // dst.SetValue(i, (uint64_t)(dstLocal[j * c0size * height * width + width * i].GetPhyAddr())); // } // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // src.SetValue(i, (uint64_t)(srcLocal[j * c0size * height * width + height * width * i].GetPhyAddr())); // } // AscendC::TransDataTo5HD<half>(dst, src, transDataParams); // workQueueSrc1.FreeTensor(dst); // workQueueSrc2.FreeTensor(src); } outQueueDst.EnQue<half>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1; AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal, dstGlobal; int srcDataSize = 16384; int dstDataSize = 16384; int width = 16; // H int height = 16; // W int c0size = 16; // C0 }; extern "C" __global__ __aicore__ void vec_transdata5hd_b16_nchw2nc1hwc0(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { KernelTransDataTo5HD op; op.Init(src, dstGm); op.Process(); }
Input: [[[[ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] ... [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.] [ 0. 0. 0. ... 0. 0. 0.]] [[ 1. 1. 1. ... 1. 1. 1.] [ 1. 1. 1. ... 1. 1. 1.] [ 1. 1. 1. ... 1. 1. 1.] ... [ 1. 1. 1. ... 1. 1. 1.] [ 1. 1. 1. ... 1. 1. 1.] [ 1. 1. 1. ... 1. 1. 1.]] [[ 2. 2. 2. ... 2. 2. 2.] [ 2. 2. 2. ... 2. 2. 2.] [ 2. 2. 2. ... 2. 2. 2.] ... [ 2. 2. 2. ... 2. 2. 2.] [ 2. 2. 2. ... 2. 2. 2.] [ 2. 2. 2. ... 2. 2. 2.]] ... [[29. 29. 29. ... 29. 29. 29.] [29. 29. 29. ... 29. 29. 29.] [29. 29. 29. ... 29. 29. 29.] ... [29. 29. 29. ... 29. 29. 29.] [29. 29. 29. ... 29. 29. 29.] [29. 29. 29. ... 29. 29. 29.]] [[30. 30. 30. ... 30. 30. 30.] [30. 30. 30. ... 30. 30. 30.] [30. 30. 30. ... 30. 30. 30.] ... [30. 30. 30. ... 30. 30. 30.] [30. 30. 30. ... 30. 30. 30.] [30. 30. 30. ... 30. 30. 30.]] [[31. 31. 31. ... 31. 31. 31.] [31. 31. 31. ... 31. 31. 31.] [31. 31. 31. ... 31. 31. 31.] ... [31. 31. 31. ... 31. 31. 31.] [31. 31. 31. ... 31. 31. 31.] [31. 31. 31. ... 31. 31. 31.]]] [[[32. 32. 32. ... 32. 32. 32.] [32. 32. 32. ... 32. 32. 32.] [32. 32. 32. ... 32. 32. 32.] ... [32. 32. 32. ... 32. 32. 32.] [32. 32. 32. ... 32. 32. 32.] [32. 32. 32. ... 32. 32. 32.]] [[33. 33. 33. ... 33. 33. 33.] [33. 33. 33. ... 33. 33. 33.] [33. 33. 33. ... 33. 33. 33.] ... [33. 33. 33. ... 33. 33. 33.] [33. 33. 33. ... 33. 33. 33.] [33. 33. 33. ... 33. 33. 33.]] [[34. 34. 34. ... 34. 34. 34.] [34. 34. 34. ... 34. 34. 34.] [34. 34. 34. ... 34. 34. 34.] ... [34. 34. 34. ... 34. 34. 34.] [34. 34. 34. ... 34. 34. 34.] [34. 34. 34. ... 34. 34. 34.]] ... [[61. 61. 61. ... 61. 61. 61.] [61. 61. 61. ... 61. 61. 61.] [61. 61. 61. ... 61. 61. 61.] ... [61. 61. 61. ... 61. 61. 61.] [61. 61. 61. ... 61. 61. 61.] [61. 61. 61. ... 61. 61. 61.]] [[62. 62. 62. ... 62. 62. 62.] [62. 62. 62. ... 62. 62. 62.] [62. 62. 62. ... 62. 62. 62.] ... [62. 62. 62. ... 62. 62. 62.] [62. 62. 62. ... 62. 62. 62.] [62. 62. 62. ... 62. 62. 62.]] [[63. 63. 63. ... 63. 63. 63.] [63. 63. 63. ... 63. 63. 63.] [63. 63. 63. ... 63. 63. 63.] ... [63. 63. 63. ... 63. 63. 63.] [63. 63. 63. ... 63. 63. 63.] [63. 63. 63. ... 63. 63. 63.]]]] Return: [[[[[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] ... [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]] [[ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] ... [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.] [ 0. 1. 2. ... 13. 14. 15.]]] [[[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] ... [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]] [[16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] ... [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.] [16. 17. 18. ... 29. 30. 31.]]]] [[[[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] ... [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]] [[32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] ... [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.] [32. 33. 34. ... 45. 46. 47.]]] [[[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] ... [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]] [[48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] ... [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.] [48. 49. 50. ... 61. 62. 63.]]]]]
- Example of int8_t (8-bit) call for transposing 2D matrix data blocks
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
#include "kernel_operator.h" class KernelTransDataTo5HD { public: __aicore__ inline KernelTransDataTo5HD() {} __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { srcGlobal.SetGlobalBuffer((__gm__ int8_t *)src); dstGlobal.SetGlobalBuffer((__gm__ int8_t *)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int8_t)); pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int8_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.AllocTensor<int8_t>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<int8_t> srcLocal = inQueueSrc.DeQue<int8_t>(); AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.AllocTensor<int8_t>(); for(int i = 0;i<dstDataSize; i++){ dstLocal.SetValue(i,0); } AscendC::TransDataTo5HDParams transDataParams; // Write data to the upper half of dstLocalList. transDataParams.dstHighHalf = 1; // Read data from the upper half of srcLocalList. transDataParams.srcHighHalf = 1; transDataParams.repeatTimes = 1; transDataParams.dstRepStride = 0; transDataParams.srcRepStride = 0; // Call mode when the input is LocalTensor AscendC::LocalTensor<int8_t> dstLocalList[16]; for (int i = 0; i < 16; i++) { dstLocalList[i] = dstLocal[width * i]; } AscendC::LocalTensor<int8_t> srcLocalList[16]; for (int i = 0; i < 16; i++) { srcLocalList[i] = srcLocal[width * i]; } AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is LocalTensor address value // uint64_t dstLocalList[16]; // for (int i = 0; i < 16; i++) { // dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr()); // } // uint64_t srcLocalList[16]; // for (int i = 0; i < 16; i++) { // srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr()); // } // AscendC::TransDataTo5HD<int8_t>(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is address LocalTensor // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr())); // } // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr())); // } // AscendC::TransDataTo5HD<int8_t>(dst, src, transDataParams); // workQueueSrc1.FreeTensor(dst); // workQueueSrc2.FreeTensor(src); outQueueDst.EnQue<int8_t>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<int8_t> dstLocal = outQueueDst.DeQue<int8_t>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1; AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<int8_t> srcGlobal, dstGlobal; int srcDataSize = 512; int dstDataSize = 512; int width = 32; }; extern "C" __global__ __aicore__ void transdata5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { KernelTransDataTo5HD op; op.Init(src, dstGm); op.Process(); }
Input: [[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] [ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] [ 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95] [ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] [ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] [ 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95] [ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] [ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] [ 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95] [ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31] [ 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63] [ 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95] [ 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127]] Output: // Read data from the upper half of the input data and write data to the upper half of the output data. [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 48 80 112 16 48 80 112 16 48 80 112 16 48 80 112 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 49 81 113 17 49 81 113 17 49 81 113 17 49 81 113 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 50 82 114 18 50 82 114 18 50 82 114 18 50 82 114 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19 51 83 115 19 51 83 115 19 51 83 115 19 51 83 115 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 52 84 116 20 52 84 116 20 52 84 116 20 52 84 116 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 53 85 117 21 53 85 117 21 53 85 117 21 53 85 117 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 54 86 118 22 54 86 118 22 54 86 118 22 54 86 118 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23 55 87 119 23 55 87 119 23 55 87 119 23 55 87 119 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 56 88 120 24 56 88 120 24 56 88 120 24 56 88 120 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 57 89 121 25 57 89 121 25 57 89 121 25 57 89 121 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 58 90 122 26 58 90 122 26 58 90 122 26 58 90 122 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 59 91 123 27 59 91 123 27 59 91 123 27 59 91 123 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 28 60 92 124 28 60 92 124 28 60 92 124 28 60 92 124 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 61 93 125 29 61 93 125 29 61 93 125 29 61 93 125 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 62 94 126 30 62 94 126 30 62 94 126 30 62 94 126 ] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31 63 95 127 31 63 95 127 31 63 95 127 31 63 95 127 ]]
- Example of half (16-bit) call for transposing 2D matrix data blocks
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
#include "kernel_operator.h" class KernelTransDataTo5HD { public: __aicore__ inline KernelTransDataTo5HD() {} __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { srcGlobal.SetGlobalBuffer((__gm__ half *)src); dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half)); pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); AscendC::TransDataTo5HDParams transDataParams; transDataParams.dstHighHalf = false; transDataParams.srcHighHalf = false; transDataParams.repeatTimes = 1; transDataParams.dstRepStride = 0; transDataParams.srcRepStride = 0; // Call mode when the input is LocalTensor AscendC::LocalTensor<half> dstLocalList[16]; for (int i = 0; i < 16; i++) { dstLocalList[i] = dstLocal[width * i]; } AscendC::LocalTensor<half> srcLocalList[16]; for (int i = 0; i < 16; i++) { srcLocalList[i] = srcLocal[width * i]; } AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is LocalTensor address value // uint64_t dstLocalList[16]; // for (int i = 0; i < 16; i++) { // dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr()); // } // uint64_t srcLocalList[16]; // for (int i = 0; i < 16; i++) { // srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr()); // } // AscendC::TransDataTo5HD<half>(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is address LocalTensor // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr())); // } // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr())); // } // AscendC::TransDataTo5HD<half>(dst, src, transDataParams); // workQueueSrc1.FreeTensor(dst); // workQueueSrc2.FreeTensor(src); outQueueDst.EnQue<half>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1; AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal, dstGlobal; int srcDataSize = 256; int dstDataSize = 256; int width = 16; }; extern "C" __global__ __aicore__ void nchwconv_demo_first(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { KernelTransDataTo5HD op; op.Init(src, dstGm); op.Process(); }
Input (src): [[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.] [ 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [ 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47.] [ 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63.] [ 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79.] [ 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95.] [ 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.] [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127.] [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141. 142. 143.] [144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157. 158. 159.] [160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173. 174. 175.] [176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189. 190. 191.] [192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207.] [208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223.] [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237. 238. 239.] [240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253. 254. 255.]] Output (dstGm): [[ 0. 16. 32. 48. 64. 80. 96. 112. 128. 144. 160. 176. 192. 208. 224. 240.] [ 1. 17. 33. 49. 65. 81. 97. 113. 129. 145. 161. 177. 193. 209. 225. 241.] [ 2. 18. 34. 50. 66. 82. 98. 114. 130. 146. 162. 178. 194. 210. 226. 242.] [ 3. 19. 35. 51. 67. 83. 99. 115. 131. 147. 163. 179. 195. 211. 227. 243.] [ 4. 20. 36. 52. 68. 84. 100. 116. 132. 148. 164. 180. 196. 212. 228. 244.] [ 5. 21. 37. 53. 69. 85. 101. 117. 133. 149. 165. 181. 197. 213. 229. 245.] [ 6. 22. 38. 54. 70. 86. 102. 118. 134. 150. 166. 182. 198. 214. 230. 246.] [ 7. 23. 39. 55. 71. 87. 103. 119. 135. 151. 167. 183. 199. 215. 231. 247.] [ 8. 24. 40. 56. 72. 88. 104. 120. 136. 152. 168. 184. 200. 216. 232. 248.] [ 9. 25. 41. 57. 73. 89. 105. 121. 137. 153. 169. 185. 201. 217. 233. 249.] [ 10. 26. 42. 58. 74. 90. 106. 122. 138. 154. 170. 186. 202. 218. 234. 250.] [ 11. 27. 43. 59. 75. 91. 107. 123. 139. 155. 171. 187. 203. 219. 235. 251.] [ 12. 28. 44. 60. 76. 92. 108. 124. 140. 156. 172. 188. 204. 220. 236. 252.] [ 13. 29. 45. 61. 77. 93. 109. 125. 141. 157. 173. 189. 205. 221. 237. 253.] [ 14. 30. 46. 62. 78. 94. 110. 126. 142. 158. 174. 190. 206. 222. 238. 254.] [ 15. 31. 47. 63. 79. 95. 111. 127. 143. 159. 175. 191. 207. 223. 239. 255.]]
- Example of int32_t (32-bit) call for transposing 2D matrix data blocks
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
#include "kernel_operator.h" class KernelTransDataTo5HD { public: __aicore__ inline KernelTransDataTo5HD() {} __aicore__ inline void Init(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { srcGlobal.SetGlobalBuffer((__gm__ int32_t *)src); dstGlobal.SetGlobalBuffer((__gm__ int32_t *)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(int32_t)); pipe.InitBuffer(workQueueSrc1, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(workQueueSrc2, 1, 16 * sizeof(uint64_t)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(int32_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.AllocTensor<int32_t>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.DeQue<int32_t>(); AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>(); AscendC::TransDataTo5HDParams transDataParams; transDataParams.dstHighHalf = false; transDataParams.srcHighHalf = false; transDataParams.repeatTimes = 1; transDataParams.dstRepStride = 0; transDataParams.srcRepStride = 0; // Call mode when the input is LocalTensor AscendC::LocalTensor<int32_t> dstLocalList[16]; for (int i = 0; i < 16; i++) { dstLocalList[i] = dstLocal[width * i]; } AscendC::LocalTensor<int32_t> srcLocalList[16]; for (int i = 0; i < 16; i++) { srcLocalList[i] = srcLocal[width * i]; } AscendC::TransDataTo5HD(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is LocalTensor address value // uint64_t dstLocalList[16]; // for (int i = 0; i < 16; i++) { // dstLocalList[i] = (uint64_t)(dstLocal[width * i].GetPhyAddr()); // } // uint64_t srcLocalList[16]; // for (int i = 0; i < 16; i++) { // srcLocalList[i] = (uint64_t)(srcLocal[width * i].GetPhyAddr()); // } // AscendC::TransDataTo5HD<int32_t>(dstLocalList, srcLocalList, transDataParams); // // Call mode when the input is address LocalTensor // AscendC::LocalTensor<uint64_t> dst = workQueueSrc1.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // dst.SetValue(i, (uint64_t)(dstLocal[width * i].GetPhyAddr())); // } // AscendC::LocalTensor<uint64_t> src = workQueueSrc2.AllocTensor<uint64_t>(); // for (int i = 0; i < 16; i++) { // src.SetValue(i, (uint64_t)(srcLocal[width * i].GetPhyAddr())); // } // AscendC::TransDataTo5HD<int32_t>(dst, src, transDataParams); // workQueueSrc1.FreeTensor(dst); // workQueueSrc2.FreeTensor(src); outQueueDst.EnQue<int32_t>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc1; AscendC::TQue<AscendC::TPosition::VECIN, 1> workQueueSrc2; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<int32_t> srcGlobal, dstGlobal; int srcDataSize = 128; int dstDataSize = 128; int width = 8; }; extern "C" __global__ __aicore__ void trans5hd_simple_kernel(__gm__ uint8_t *src, __gm__ uint8_t *dstGm) { KernelTransDataTo5HD op; op.Init(src, dstGm); op.Process(); }
Input (src): [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127] Output (dstGm): [0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 2 10 18 26 34 42 50 58 66 74 82 90 98 106 114 122 3 11 19 27 35 43 51 59 67 75 83 91 99 107 115 123 4 12 20 28 36 44 52 60 68 76 84 92 100 108 116 124 5 13 21 29 37 45 53 61 69 77 85 93 101 109 117 125 6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126 7 15 23 31 39 47 55 63 71 79 87 95 103 111 119 127]