MmadWithSparse
Supported Products
Product |
Supported (√/x) |
|---|---|
√ |
|
√ |
|
x |
|
x |
|
x |
|
x |
Function Usage
Performs matrix multiplication and addition operations. The passed left matrix A is a sparse matrix, and the passed right matrix B is a dense matrix. For matrix A, densification is completed during MmadWithSparse computation. For matrix B, densification is automatically completed during input data preparation before computation execution (densification is performed according to the densification algorithm described below). Therefore, matrix B passed to this API is a dense matrix. The dense matrix B needs to be loaded by calling LoadDataWithSparse, and the index matrix needs to be loaded at the same time. The index matrix is generated during the densification of matrix B and then used for the densification of matrix A.
Prototype
1 2 | template <typename T = int32_t, typename U = int8_t, typename Std::enable_if<Std::is_same<PrimT<T>, int32_t>::value, bool>::type = true, typename Std::enable_if<Std::is_same<PrimT<U>, int8_t>::value, bool>::type = true> __aicore__ inline void MmadWithSparse(const LocalTensor<T>& dst, const LocalTensor<U>& fm, const LocalTensor<U>& filter, const MmadParams& mmadParams) |
Parameters
Parameter |
Description |
|---|---|
T |
Data type of dst. |
U |
Data type of fm and filter.
The last two template parameters are used only for checking the preceding data types. |
Parameter |
Input/Output |
Meaning |
|---|---|---|
dst |
Output |
Destination operand; result matrix. Type: LocalTensor. Supported TPosition: CO1. The start address of LocalTensor must be 256-element (1024-byte) aligned. |
fm |
Input |
Source operand; left matrix A. Type: LocalTensor. Supported TPosition: A2. The start address of LocalTensor must be 512-byte aligned. |
filter |
Input |
Source operand; right matrix B. Type: LocalTensor. Supported TPosition: B2. The start address of LocalTensor must be 512-byte aligned. |
mmadParams |
Input |
Matrix multiplication parameters, of the MmadParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_mm.h. Replace ${INSTALL_DIR} with the actual CANN component directory. For details about the parameter description, see Table 3. |
Restrictions
- In the original sparse matrix B, there should be a maximum of two non-zero elements in every four elements. If there are three or more non-zero elements, only the first two non-zero elements are used.
- When any of M, K, N is 0, this instruction is not executed.
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
Dense Algorithm Description
It is assumed that there are at least two zeros in every four elements of the original sparse matrix B, and the matrix B after densification is a dense matrix in which two zeros are filtered out from every four elements. An index matrix is generated in a densification process of the matrix B. The process is as follows: For every four elements in the sparse matrix B, two 2-bit indexes are generated in the index matrix, and encoding is performed according to the following rule. The index must be in the range of {0, 1, 2}.
- The first index is used to indicate the relative location of the first non-zero element in the first three elements.
- The second index is used to indicate the relative location of the second non-zero element in the last three elements.
For details, see the following table. - indicates that the algorithm does not care about the value at the position because the value will be filtered.
Example |
ele0 |
ele1 |
ele2 |
ele3 |
Index_a[i] |
Index_b[i] |
|---|---|---|---|---|---|---|
Two non-zero elements |
0 |
0 |
X |
Y |
2'b10 |
2'b10 |
0 |
X |
0 |
Y |
2'b01 |
2'b10 |
|
X |
0 |
0 |
Y |
2'b00 |
2'b10 |
|
0 |
X |
Y |
- |
2'b01 |
2'b01 |
|
X |
0 |
Y |
- |
2'b00 |
2'b01 |
|
X |
Y |
- |
- |
2'b00 |
2'b00 |
|
One non-zero element |
0 |
0 |
0 |
X |
2'b00 |
2'b10 |
0 |
0 |
X |
0 |
2'b10 |
2'b00 |
|
0 |
X |
0 |
0 |
2'b01 |
2'b00 |
|
X |
0 |
0 |
0 |
2'b00 |
2'b00 |
|
All zeros |
0 |
0 |
0 |
0 |
2'b00 |
2'b00 |
The index matrix is used for densification of matrix A. Based on the index matrix, two elements are selected from the four elements in matrix A for computation, as shown in the following figure.

Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | #include "kernel_operator.h" class KernelMatmul { public: __aicore__ inline KernelMatmul() {} __aicore__ inline void Init(__gm__ uint8_t* a, __gm__ uint8_t* b, __gm__ uint8_t* idx, __gm__ uint8_t* c, uint16_t m, uint16_t k, uint16_t n) { this->m = m; this->k = k; this->n = n; aSize = m * k; bSize = k / 2 * n; cSize = m * n; mBlocks = m / 16; nBlocks = n / 16; kBlocks = k / 32; aGM.SetGlobalBuffer((__gm__ int8_t*)a); bGM.SetGlobalBuffer((__gm__ int8_t*)b); idxGM.SetGlobalBuffer((__gm__ uint8_t*)idx); cGM.SetGlobalBuffer((__gm__ int32_t*)c); pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(int8_t)); pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(int8_t)); pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(int8_t)); pipe.InitBuffer(inQueueIdxB1, 1, (bSize / 4) * sizeof(int8_t)); pipe.InitBuffer(inQueueB2, 1, bSize * sizeof(int8_t)); pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(int32_t)); } __aicore__ inline void Process() { CopyIn(); SplitA(); AscendC::LocalTensor<int8_t> b1Local = inQueueB1.DeQue<int8_t>(); AscendC::LocalTensor<uint8_t> idexb1Local = inQueueIdxB1.DeQue<uint8_t>(); AscendC::LocalTensor<int8_t> a2Local = inQueueA2.DeQue<int8_t>(); SplitB(b1Local, idexb1Local); Compute(a2Local); inQueueB1.FreeTensor(b1Local); inQueueIdxB1.FreeTensor(idexb1Local); inQueueA2.FreeTensor(a2Local); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<int8_t> a1Local = inQueueA1.AllocTensor<int8_t>(); AscendC::LocalTensor<int8_t> b1Local = inQueueB1.AllocTensor<int8_t>(); AscendC::LocalTensor<uint8_t> idxb1Local = inQueueIdxB1.AllocTensor<uint8_t>(); AscendC::DataCopy(a1Local, aGM, { 1, static_cast<uint16_t>(aSize * sizeof(int8_t) / 32), 0, 0 }); AscendC::DataCopy(b1Local, bGM, { 1, static_cast<uint16_t>(bSize * sizeof(int8_t) / 32), 0, 0 }); AscendC::DataCopy(idxb1Local, idxGM, { 1, static_cast<uint16_t>(bSize / 4 * sizeof(int8_t) / 32), 0, 0 }); inQueueA1.EnQue(a1Local); inQueueB1.EnQue(b1Local); inQueueIdxB1.EnQue(idxb1Local); } __aicore__ inline void SplitA() { int srcOffset = 0; int dstOffset = 0; AscendC::LocalTensor<int8_t> a1Local = inQueueA1.DeQue<int8_t>(); AscendC::LocalTensor<int8_t> a2Local = inQueueA2.AllocTensor<int8_t>(); AscendC::LoadData2DParams loadDataParams; loadDataParams.repeatTimes = kBlocks * mBlocks; loadDataParams.srcStride = 1; loadDataParams.ifTranspose = false; AscendC::LoadData(a2Local, a1Local, loadDataParams); inQueueA2.EnQue<int8_t>(a2Local); inQueueA1.FreeTensor(a1Local); } __aicore__ inline void SplitB(AscendC::LocalTensor<int8_t>& b1Local, AscendC::LocalTensor<uint8_t>& idxb1Local) { AscendC::LocalTensor<int8_t> b2Local = inQueueB2.AllocTensor<int8_t>(); // transform nz to zn AscendC::LoadData2DParams loadDataParams; loadDataParams.repeatTimes = kBlocks * nBlocks / 2; loadDataParams.srcStride = 0; loadDataParams.ifTranspose = false; AscendC::LoadDataWithSparse(b2Local, b1Local, idxb1Local, loadDataParams); inQueueB2.EnQue<int8_t>(b2Local); } __aicore__ inline void Compute(const AscendC::LocalTensor<int8_t>& a2Local) { AscendC::LocalTensor<int8_t> b2Local = inQueueB2.DeQue<int8_t>(); AscendC::LocalTensor<int32_t> c1Local = outQueueCO1.AllocTensor<int32_t>(); AscendC::MmadWithSparse(c1Local, a2Local, b2Local, { m, n, k, false, 0, false, false, false }); outQueueCO1.EnQue<int32_t>(c1Local); inQueueB2.FreeTensor(b2Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<int32_t> c1Local = outQueueCO1.DeQue<int32_t>(); AscendC::FixpipeParamsV220 fixpipeParams; fixpipeParams.nSize = n; fixpipeParams.mSize = m; fixpipeParams.srcStride = m; fixpipeParams.dstStride = n; fixpipeParams.ndNum = 1; fixpipeParams.srcNdStride = 0; fixpipeParams.dstNdStride = 0; AscendC::Fixpipe(cGM, c1Local, fixpipeParams); outQueueCO1.FreeTensor(c1Local); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1; AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2; AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1; AscendC::TQue<AscendC::TPosition::B1, 1> inQueueIdxB1; AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2; // dst queue AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1; AscendC::GlobalTensor<int8_t> aGM, bGM; AscendC::GlobalTensor<uint8_t> idxGM; AscendC::GlobalTensor<int32_t> cGM; uint16_t m; uint16_t n; uint16_t k; uint16_t aSize, bSize, cSize, mBlocks, nBlocks, kBlocks; }; #define KERNEL_MMAD_WITH_SPARSE_OPERATOR_TEST(m, k, n) \ extern "C" __global__ __aicore__ void kernel_mmad_with_sparse_operator##_##m##_##k##_##n( \ GM_ADDR a, GM_ADDR b, GM_ADDR idx, GM_ADDR c) \ { \ KernelMatmul op; \ op.Init(a, b, idx, c, m, k, n); \ op.Process(); \ } KERNEL_MMAD_WITH_SPARSE_OPERATOR_TEST(16, 64, 16) |