LoadDataWithTranspose
Supported Products
Product |
Supported/Unsupported |
|---|---|
√ |
|
√ |
|
√ |
|
x |
|
x |
|
x |
Function Usage
Loads 2D data with transposing from A1/B1 to A2/B2.
The following uses an example to describe the API functions and key parameters. In the following figure, an N shape or a Z shape represents a fractal.
- For the uint8_t/int8_t type, 32 x 32 x 1 bytes data is processed in each iteration. Two fractals (512 bytes each) can be processed. In each iteration, two consecutive 16 x 32 fractals in the source operand are combined into one 32 x 32 square matrix. Transpose is performed based on the square matrix, and then the transposed matrix is split into two 16 x 32 fractals. There may be different arrangements according to parameters such as the fractal interval of the destination operand.The following figure shows an example.
- A total of 3072 bytes of data needs to be processed. 32 x 32 x 1 byte data needs to be processed in each iteration, and three iterations are required. repeatTime = 3;
- srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 32 x 32 x 1 bytes). The unit is the size of the square matrix after combination.
- dstGap = 1: The gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand is 1 (unit: 512 bytes).
- dstFracGap = 0: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 0 (unit: 512 bytes).

The following figure shows an example.
- The description of repeatTime and srcStride is the same as that in the preceding figure.
- dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
- dstFracGap = 2: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 2 (unit: 512 bytes).

- For the half/bfloat16_t type, 16 x 16 x 2 bytes data is processed in each iteration. One fractal (512 bytes each) can be processed. In each iteration, one 16 x 16 fractal in the source operand is transposed.
- A total of 1536 bytes of data needs to be processed. 16 x 16 x 2 bytes data needs to be processed in each iteration, and three iterations are required. repeatTime = 3.
- srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 16 x 16 x 2 bytes).
- dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
- In this scenario, the fractal is a square matrix. Each iteration processes a fractal, and there is no interval between fractals in the iteration. Therefore, this parameter is invalid.

- For the float/int32_t/uint32_t type, 16 x 16 x 4 bytes data is processed in each iteration. Two fractals (512 bytes each) can be processed. In each iteration, two consecutive 16 x 8 fractals in the source operand are combined into one 16 x 16 square matrix. Transpose is performed based on the square matrix, and then the transposed matrix is split into two 16 x 8 fractals. There may be different arrangements according to parameters such as the fractal interval of the destination operand.The following figure shows an example.
- A total of 3072 bytes of data needs to be processed. 16 x 16 x 4 bytes data needs to be processed in each iteration, and three iterations are required. repeatTime = 3;
- srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 16 x 16 x 4 bytes). The unit is the size of the square matrix after combination.
- dstGap = 1: The gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand is 1 (unit: 512 bytes).
- dstFracGap = 0: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 0 (unit: 512 bytes).

The following figure shows an example.
- The description of repeatTime and srcStride is the same as that in the preceding figure.
- dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
- dstFracGap = 2: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 2 (unit: 512 bytes).

- For the int4b_t type, 64 x 64 x 0.5 bytes data is processed in each iteration. Four fractals (512 bytes each) can be processed. In each iteration, four consecutive 16 x 64 fractals in the source operand are combined into one 64 x 64 square matrix. Transpose is performed based on the square matrix, and then the transposed matrix is split into four 16 x 64 fractals. There may be different arrangements according to parameters such as the fractal interval of the destination operand.
The int4b_t data type requires two numbers to be combined into an number of type int8_t or uint8_t. The combination rule is as follows.
The following figure shows an example.- A total of 6144 bytes of data needs to be processed. 64 x 64 x 0.5 bytes of data needs to be processed in each iteration, and three iterations are required. repeatTime = 3.
- srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 64 x 64 x 0.5 bytes). The unit is the size of the square matrix after combination.
- dstGap = 1: The gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand is 1 (unit: 512 bytes).
- dstFracGap = 0: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 0 (unit: 512 bytes).

The following figure shows an example.
- The description of repeatTime and srcStride is the same as that in the preceding figure.
- dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
- dstFracGap = 2: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 2 (unit: 512 bytes).

Prototype
1 2 | template <typename T> __aicore__ inline void LoadDataWithTranspose(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LoadData2dTransposeParams& loadDataParams) |
Parameters
Parameter |
Description |
|---|---|
T |
The int4b_t type is supported only when TPosition of LocalTensor is B2. |
Parameter |
Input/Output |
Meaning |
|---|---|---|
dst |
Output |
Destination operand for the result matrix, which is of the LocalTensor type. For the For the For the The start address of LocalTensor must be 512-byte aligned. The data type must be the same as that of src. |
src |
Input |
Source operand of the LocalTensor type. For the For the For the The start address of LocalTensor must be 32-byte aligned. The data type must be the same as that of dst. |
loadDataParams |
Input |
Parameters related to LoadDataWithTranspose, of the LoadData2dTransposeParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_mm.h. Replace ${INSTALL_DIR} with the actual CANN component directory. For details about the parameter description, see Table 3. |
Parameter |
Input/Output |
Meaning |
|---|---|---|
startIndex |
Input |
Square matrix ID, indicating the sequence number of the square matrix of the source operand from which the movement starts (0 indicates the first square matrix of the source operand). Value range: startIndex ∈ [0, 65535]. The default value is 0. For example, if the source operand has 20 fractals whose size is 16 x 8 x 4 bytes (data type: float) and startIndex is 1, the transfer start position is the second square matrix, that is, the third and fourth fractals are transposed from the source operand to the destination operand (the first and second fractals form the first square matrix, and the third and fourth fractals form the second square matrix). |
repeatTimes |
Input |
Number of iterations. For the uint8_t/int8_t type, 32 x 32 x 1 bytes data is processed in each iteration. For the half/bfloat16_t type, 16 x 16 x 2 bytes data is processed in each iteration. For the float/int32_t/uint32_t type, 16 x 16 x 4 bytes data is processed in each iteration. For the int4b_t type, 16 x 64 x 0.5 bytes data is processed in each iteration. Value range: repeatTimes ∈ [0, 255]. The default value is 0. |
srcStride |
Input |
In adjacent iterations, the interval between the start addresses of fractals of the source operand. The unit here is actually the size of the combined square matrix. For the uint8_t/int8_t type, the unit is 32 x 32 x 1 bytes. For the half/bfloat16_t type, the unit is 16 x 16 x 2 bytes. For the float/int32_t/uint32_t type, the unit is 16 x 16 x 4 bytes. For the int4b_t type, 16 x 64 x 0.5 bytes data is processed in each iteration. Value range: srcStride ∈ [0, 65535]. The default value is 0. |
dstGap |
Input |
Gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand (unit: 512 bytes). Value range: dstGap ∈ [0, 65535]. The default value is 0. |
dstFracGap |
Input |
Gap between the previous fractal end address and the next fractal start address for transposing the destination operand in each iteration. The unit is 512 bytes. This parameter is valid only when the data type is float/int32_t/uint32_t/uint8_t/int8_t/int4b_t. Value range: dstFracGap ∈ [0, 65535]. The default value is 0. |
addrMode |
Input |
Reserved. It is reserved for subsequent functions. You can use the default value for now. |
Restrictions
- If repeat is set to 0, no movement operation is performed.
- Developers need to ensure that the fractals after the transposition of the target operand do not overlap.
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
Examples
- Example 1: In this example, the input matrix a is of the int8_t type and the shape is [16,32], the input matrix b is of the int8_t type and the shape is [32,64], and the output matrix c is of the int32_t type. Matrix a is not transposed from A1 to A2, matrix b is transposed from B1 to B2, and then Mmad and Fixpipe are performed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
#include "kernel_operator.h" template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelMatmul { public: __aicore__ inline KernelMatmul() { aSize = m * k; bSize = k * n; cSize = m * n; nBlocks = n / 16; } __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) { aGM.SetGlobalBuffer((__gm__ fmap_T *)a); bGM.SetGlobalBuffer((__gm__ weight_T *)b); cGM.SetGlobalBuffer((__gm__ dstCO1_T *)c); pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(fmap_T)); pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(fmap_T)); pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(weight_T)); pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(weight_T)); pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(dstCO1_T)); } __aicore__ inline void Process() { CopyIn(); SplitA(); SplitB(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.AllocTensor<fmap_T>(); AscendC::LocalTensor<weight_T> b1Local = inQueueB1.AllocTensor<weight_T>(); AscendC::Nd2NzParams dataCopyA1Params; dataCopyA1Params.ndNum = 1; dataCopyA1Params.nValue = m; dataCopyA1Params.dValue = k; dataCopyA1Params.srcNdMatrixStride = 0; dataCopyA1Params.srcDValue = k; dataCopyA1Params.dstNzC0Stride = m; dataCopyA1Params.dstNzNStride = 1; dataCopyA1Params.dstNzMatrixStride = 0; AscendC::DataCopy(a1Local, aGM, dataCopyA1Params); AscendC::Nd2NzParams dataCopyB1Params; dataCopyB1Params.ndNum = 1; dataCopyB1Params.nValue = k; dataCopyB1Params.dValue = n; dataCopyB1Params.srcNdMatrixStride = 0; dataCopyB1Params.srcDValue = n; dataCopyB1Params.dstNzC0Stride = k; dataCopyB1Params.dstNzNStride = 1; dataCopyB1Params.dstNzMatrixStride = 0; AscendC::DataCopy(b1Local, bGM, dataCopyB1Params); inQueueA1.EnQue(a1Local); inQueueB1.EnQue(b1Local); } __aicore__ inline void SplitA() { AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.DeQue<fmap_T>(); AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.AllocTensor<fmap_T>(); AscendC::LoadData2DParams loadL0AParams; loadL0AParams.repeatTimes = aSize * sizeof(fmap_T) / 512; loadL0AParams.srcStride = 1; loadL0AParams.ifTranspose = false; AscendC::LoadData(a2Local, a1Local, loadL0AParams); inQueueA2.EnQue<fmap_T>(a2Local); inQueueA1.FreeTensor(a1Local); } __aicore__ inline void SplitB() { AscendC::LocalTensor<weight_T> b1Local = inQueueB1.DeQue<weight_T>(); AscendC::LocalTensor<weight_T> b2Local = inQueueB2.AllocTensor<weight_T>(); AscendC::LoadData2dTransposeParams loadDataParams; loadDataParams.startIndex = 0; nBlockSize = 32; loadDataParams.repeatTimes = n / nBlockSize; loadDataParams.srcStride = 1; loadDataParams.dstGap = 1; loadDataParams.dstFracGap = 0; AscendC::LoadDataWithTranspose(b2Local, b1Local, loadDataParams); inQueueB1.FreeTensor(b1Local); inQueueB2.EnQue<weight_T>(b2Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.DeQue<fmap_T>(); AscendC::LocalTensor<weight_T> b2Local = inQueueB2.DeQue<weight_T>(); AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.AllocTensor<dstCO1_T>(); AscendC::MmadParams mmadParams; mmadParams.m = m; mmadParams.n = n; mmadParams.k = k; AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams); outQueueCO1.EnQue<dstCO1_T>(c1Local); inQueueA2.FreeTensor(a2Local); inQueueB2.FreeTensor(b2Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.DeQue<dstCO1_T>(); AscendC::FixpipeParamsV220 fixpipeParams; fixpipeParams.nSize = n; fixpipeParams.mSize = m; fixpipeParams.srcStride = m; fixpipeParams.dstStride = n; fixpipeParams.ndNum = 1; fixpipeParams.srcNdStride = 0; fixpipeParams.dstNdStride = 0; AscendC::Fixpipe(cGM, c1Local, fixpipeParams); outQueueCO1.FreeTensor(c1Local); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1; AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2; AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1; AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2; // dst queue AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1; AscendC::GlobalTensor<fmap_T> aGM; AscendC::GlobalTensor<weight_T> bGM; AscendC::GlobalTensor<dst_T> cGM; uint16_t m = 16, k = 32, n = 64; uint8_t nBlockSize = 16; uint16_t c0Size = 16; uint16_t aSize, bSize, cSize, nBlocks; }; extern "C" __global__ __aicore__ void cube_matmul_loaddata_operator_int8_t(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) { KernelMatmul<dst_type, fmap_type, weight_type, dstCO1_type> op; op.Init(a, b, c); op.Process(); }
- Example 2: In this example, the input matrix a is of the half type and the shape is [16,32], the input matrix b is of the half type and the shape is [32,32], and the output matrix c is of the float type. Matrix a is not transposed from A1 to A2, matrix b is transposed from B1 to B2, and then Mmad and Fixpipe are performed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
#include "kernel_operator.h" template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelMatmul { public: __aicore__ inline KernelMatmul() { aSize = m * k; bSize = k * n; cSize = m * n; nBlocks = n / 16; } __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) { aGM.SetGlobalBuffer((__gm__ fmap_T *)a); bGM.SetGlobalBuffer((__gm__ weight_T *)b); cGM.SetGlobalBuffer((__gm__ dstCO1_T *)c); pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(fmap_T)); pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(fmap_T)); pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(weight_T)); pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(weight_T)); pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(dstCO1_T)); } __aicore__ inline void Process() { CopyIn(); SplitA(); SplitB(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.AllocTensor<fmap_T>(); AscendC::LocalTensor<weight_T> b1Local = inQueueB1.AllocTensor<weight_T>(); AscendC::Nd2NzParams dataCopyA1Params; dataCopyA1Params.ndNum = 1; dataCopyA1Params.nValue = m; dataCopyA1Params.dValue = k; dataCopyA1Params.srcNdMatrixStride = 0; dataCopyA1Params.srcDValue = k; dataCopyA1Params.dstNzC0Stride = m; dataCopyA1Params.dstNzNStride = 1; dataCopyA1Params.dstNzMatrixStride = 0; AscendC::DataCopy(a1Local, aGM, dataCopyA1Params); AscendC::Nd2NzParams dataCopyB1Params; dataCopyB1Params.ndNum = 1; dataCopyB1Params.nValue = k; dataCopyB1Params.dValue = n; dataCopyB1Params.srcNdMatrixStride = 0; dataCopyB1Params.srcDValue = n; dataCopyB1Params.dstNzC0Stride = k; dataCopyB1Params.dstNzNStride = 1; dataCopyB1Params.dstNzMatrixStride = 0; AscendC::DataCopy(b1Local, bGM, dataCopyB1Params); inQueueA1.EnQue(a1Local); inQueueB1.EnQue(b1Local); } __aicore__ inline void SplitA() { AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.DeQue<fmap_T>(); AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.AllocTensor<fmap_T>(); AscendC::LoadData2DParams loadL0AParams; loadL0AParams.repeatTimes = aSize * sizeof(fmap_T) / 512; loadL0AParams.srcStride = 1; loadL0AParams.ifTranspose = false; AscendC::LoadData(a2Local, a1Local, loadL0AParams); inQueueA2.EnQue<fmap_T>(a2Local); inQueueA1.FreeTensor(a1Local); } __aicore__ inline void SplitB() { AscendC::LocalTensor<weight_T> b1Local = inQueueB1.DeQue<weight_T>(); AscendC::LocalTensor<weight_T> b2Local = inQueueB2.AllocTensor<weight_T>(); AscendC::LoadData2dTransposeParams loadDataParams; loadDataParams.startIndex = 0; nBlockSize = 16; loadDataParams.repeatTimes = k / nBlockSize; loadDataParams.srcStride = 1; loadDataParams.dstGap = 1; for (int i = 0; i < (n / nBlockSize); ++i) { AscendC::LoadDataWithTranspose(b2Local[i * 16 * nBlockSize], b1Local[i * k * nBlockSize], loadDataParams); } inQueueB1.FreeTensor(b1Local); inQueueB2.EnQue<weight_T>(b2Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.DeQue<fmap_T>(); AscendC::LocalTensor<weight_T> b2Local = inQueueB2.DeQue<weight_T>(); AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.AllocTensor<dstCO1_T>(); AscendC::MmadParams mmadParams; mmadParams.m = m; mmadParams.n = n; mmadParams.k = k; AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams); outQueueCO1.EnQue<dstCO1_T>(c1Local); inQueueA2.FreeTensor(a2Local); inQueueB2.FreeTensor(b2Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.DeQue<dstCO1_T>(); AscendC::FixpipeParamsV220 fixpipeParams; fixpipeParams.nSize = n; fixpipeParams.mSize = m; fixpipeParams.srcStride = m; fixpipeParams.dstStride = n; fixpipeParams.ndNum = 1; fixpipeParams.srcNdStride = 0; fixpipeParams.dstNdStride = 0; AscendC::Fixpipe(cGM, c1Local, fixpipeParams); outQueueCO1.FreeTensor(c1Local); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1; AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2; AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1; AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2; // dst queue AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1; AscendC::GlobalTensor<fmap_T> aGM; AscendC::GlobalTensor<weight_T> bGM; AscendC::GlobalTensor<dst_T> cGM; uint16_t m = 16, k = 32, n = 32; uint8_t nBlockSize = 16; uint16_t c0Size = 16; uint16_t aSize, bSize, cSize, nBlocks; }; extern "C" __global__ __aicore__ void cube_matmul_loaddata_operator_half(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) { KernelMatmul<dst_type, fmap_type, weight_type, dstCO1_type> op; op.Init(a, b, c); op.Process(); }
- Example 3: In this example, the input matrix a is of the float type and the shape is [16,16], the input matrix b is of the float type and the shape is [16,32], and the output matrix c is of the float type. Matrix a is not transposed from A1 to A2, matrix b is transposed from B1 to B2, and then Mmad and Fixpipe are performed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
#include "kernel_operator.h" template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelMatmul { public: __aicore__ inline KernelMatmul() { aSize = m * k; bSize = k * n; cSize = m * n; nBlocks = n / 16; } __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) { aGM.SetGlobalBuffer((__gm__ fmap_T *)a); bGM.SetGlobalBuffer((__gm__ weight_T *)b); cGM.SetGlobalBuffer((__gm__ dstCO1_T *)c); pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(fmap_T)); pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(fmap_T)); pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(weight_T)); pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(weight_T)); pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(dstCO1_T)); } __aicore__ inline void Process() { CopyIn(); SplitA(); SplitB(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.AllocTensor<fmap_T>(); AscendC::LocalTensor<weight_T> b1Local = inQueueB1.AllocTensor<weight_T>(); AscendC::Nd2NzParams dataCopyA1Params; dataCopyA1Params.ndNum = 1; dataCopyA1Params.nValue = m; dataCopyA1Params.dValue = k; dataCopyA1Params.srcNdMatrixStride = 0; dataCopyA1Params.srcDValue = k; dataCopyA1Params.dstNzC0Stride = m; dataCopyA1Params.dstNzNStride = 1; dataCopyA1Params.dstNzMatrixStride = 0; AscendC::DataCopy(a1Local, aGM, dataCopyA1Params); AscendC::Nd2NzParams dataCopyB1Params; dataCopyB1Params.ndNum = 1; dataCopyB1Params.nValue = k; dataCopyB1Params.dValue = n; dataCopyB1Params.srcNdMatrixStride = 0; dataCopyB1Params.srcDValue = n; dataCopyB1Params.dstNzC0Stride = k; dataCopyB1Params.dstNzNStride = 1; dataCopyB1Params.dstNzMatrixStride = 0; AscendC::DataCopy(b1Local, bGM, dataCopyB1Params); inQueueA1.EnQue(a1Local); inQueueB1.EnQue(b1Local); } __aicore__ inline void SplitA() { AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.DeQue<fmap_T>(); AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.AllocTensor<fmap_T>(); AscendC::LoadData2DParams loadL0AParams; loadL0AParams.repeatTimes = aSize * sizeof(fmap_T) / 512; loadL0AParams.srcStride = 1; loadL0AParams.ifTranspose = false; AscendC::LoadData(a2Local, a1Local, loadL0AParams); inQueueA2.EnQue<fmap_T>(a2Local); inQueueA1.FreeTensor(a1Local); } __aicore__ inline void SplitB() { AscendC::LocalTensor<weight_T> b1Local = inQueueB1.DeQue<weight_T>(); AscendC::LocalTensor<weight_T> b2Local = inQueueB2.AllocTensor<weight_T>(); AscendC::LoadData2dTransposeParams loadDataParams; loadDataParams.startIndex = 0; nBlockSize = 16; loadDataParams.repeatTimes = n / nBlockSize; loadDataParams.srcStride = 1; loadDataParams.dstGap = 0; loadDataParams.dstFracGap = n / nBlockSize - 1; AscendC::LoadDataWithTranspose(b2Local, b1Local, loadDataParams); inQueueB1.FreeTensor(b1Local); inQueueB2.EnQue<weight_T>(b2Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.DeQue<fmap_T>(); AscendC::LocalTensor<weight_T> b2Local = inQueueB2.DeQue<weight_T>(); AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.AllocTensor<dstCO1_T>(); AscendC::MmadParams mmadParams; mmadParams.m = m; mmadParams.n = n; mmadParams.k = k; AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams); outQueueCO1.EnQue<dstCO1_T>(c1Local); inQueueA2.FreeTensor(a2Local); inQueueB2.FreeTensor(b2Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.DeQue<dstCO1_T>(); AscendC::FixpipeParamsV220 fixpipeParams; fixpipeParams.nSize = n; fixpipeParams.mSize = m; fixpipeParams.srcStride = m; fixpipeParams.dstStride = n; fixpipeParams.ndNum = 1; fixpipeParams.srcNdStride = 0; fixpipeParams.dstNdStride = 0; AscendC::Fixpipe(cGM, c1Local, fixpipeParams); outQueueCO1.FreeTensor(c1Local); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1; AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2; AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1; AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2; // dst queue AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1; AscendC::GlobalTensor<fmap_T> aGM; AscendC::GlobalTensor<weight_T> bGM; AscendC::GlobalTensor<dst_T> cGM; uint16_t m = 16, k = 16, n = 32; uint8_t nBlockSize = 16; uint16_t c0Size = 16; uint16_t aSize, bSize, cSize, nBlocks; }; extern "C" __global__ __aicore__ void cube_matmul_loaddata_operator_float(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c) { KernelMatmul<dst_type, fmap_type, weight_type, dstCO1_type> op; op.Init(a, b, c); op.Process(); }