MmadWithSparse

Product Support

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	x
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Performs matrix multiplication and addition operations. The passed left matrix A is a sparse matrix, and the passed right matrix B is a dense matrix. For matrix A, densification is completed during MmadWithSparse computation. For matrix B, densification is automatically completed during input data preparation before computation execution (densification is performed according to the densification algorithm described below). Therefore, matrix B passed to this API is a dense matrix. The dense matrix B needs to be loaded by calling LoadDataWithSparse, and the index matrix needs to be loaded at the same time. The index matrix is generated during the densification of matrix B and then used for the densification of matrix A.

Prototype

template <typename T = int32_t, typename U = int8_t, typename Std::enable_if<Std::is_same<PrimT<T>, int32_t>::value, bool>::type = true, typename Std::enable_if<Std::is_same<PrimT<U>, int8_t>::value, bool>::type = true>
__aicore__ inline void MmadWithSparse(const LocalTensor<T>& dst, const LocalTensor<U>& fm, const LocalTensor<U>& filter, const MmadParams& mmadParams)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of dst.
U	Data types of fm and filter. When dst, fm, and filter are of basic data types, T must be of the int32_t type, and U must be of the int8_t type. Otherwise, the compilation fails. When dst, fm, and filter are of the TensorTrait type, the LiteType of T must be the int32_t type and the LiteType of U must be the int8_t type. Otherwise, the compilation fails. The last two template parameters are used only for checking the preceding data types.

**Table 2** Parameters
Parameter	Input/Output	Meaning
dst	Output	Destination operand; result matrix. Type: LocalTensor. Supported TPosition: CO1. The start address of LocalTensor must be 256-element (1024-byte) aligned.
fm	Input	Source operand; left matrix A. Type: LocalTensor. Supported TPosition: A2. The start address of LocalTensor must be 512-byte aligned.
filter	Input	Source operand; right matrix B. Type: LocalTensor. Supported TPosition: B2. The start address of LocalTensor must be 512-byte aligned.
mmadParams	Input	Matrix multiplication parameters, of the MmadParams type. For details about the definition, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_mm.h. Replace *${INSTALL_DIR}* with the file storage path after the CANN software is installed. For details about the parameter description, see Table 3.

Restrictions

In the original sparse matrix B, there should be a maximum of two non-zero elements in every four elements. If there are three or more non-zero elements, only the first two non-zero elements are used.
If any of M, K, and N is 0, the instruction is not executed.

For details about the operand address alignment requirements, see General Address Alignment Restrictions.

Dense Algorithm Description

It is assumed that there are at least two zeros in every four elements of the original sparse matrix B, and the matrix B after densification is a dense matrix in which two zeros are filtered out from every four elements. An index matrix is generated in a densification process of the matrix B. The process is as follows: For every four elements in the sparse matrix B, two 2-bit indexes are generated in the index matrix, and encoding is performed according to the following rule. The index must be in the range of {0, 1, 2}.

The first index is used to indicate the relative location of the first non-zero element in the first three elements.
The second index is used to indicate the relative location of the second non-zero element in the last three elements.

For details, see the following table. - indicates that the algorithm does not care about the value at the position because the value will be filtered.

Example	ele0	ele1	ele2	ele3	Index_a[i]	Index_b[i]
Two non-zero elements	0	0	X	Y	2'b10	2'b10
	0	X	0	Y	2'b01	2'b10
	X	0	0	Y	2'b00	2'b10
	0	X	Y	-	2'b01	2'b01
	X	0	Y	-	2'b00	2'b01
	X	Y	-	-	2'b00	2'b00
One non-zero element	0	0	0	X	2'b00	2'b10
	0	0	X	0	2'b10	2'b00
	0	X	0	0	2'b01	2'b00
	X	0	0	0	2'b00	2'b00
All zeros	0	0	0	0	2'b00	2'b00

The index matrix is used for densification of matrix A. Based on the index matrix, two elements are selected from the four elements in matrix A for computation, as shown in the following figure.

Example

#include "kernel_operator.h"

class KernelMatmul {
public:
    __aicore__ inline KernelMatmul() {}
    __aicore__ inline void Init(__gm__ uint8_t* a, __gm__ uint8_t* b, __gm__ uint8_t* idx, __gm__ uint8_t* c, uint16_t m, uint16_t k, uint16_t n)
    {
        this->m = m;
        this->k = k;
        this->n = n;

        aSize = m * k;
        bSize = k / 2 * n;
        cSize = m * n;
        mBlocks = m / 16;
        nBlocks = n / 16;
        kBlocks = k / 32;

        aGM.SetGlobalBuffer((__gm__ int8_t*)a);
        bGM.SetGlobalBuffer((__gm__ int8_t*)b);
        idxGM.SetGlobalBuffer((__gm__ uint8_t*)idx);
        cGM.SetGlobalBuffer((__gm__ int32_t*)c);
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(int8_t));
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(int8_t));
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(int8_t));
        pipe.InitBuffer(inQueueIdxB1, 1, (bSize / 4) * sizeof(int8_t));
        pipe.InitBuffer(inQueueB2, 1, bSize * sizeof(int8_t));
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(int32_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        SplitA();

        AscendC::LocalTensor<int8_t> b1Local = inQueueB1.DeQue<int8_t>();
        AscendC::LocalTensor<uint8_t> idexb1Local = inQueueIdxB1.DeQue<uint8_t>();
        AscendC::LocalTensor<int8_t> a2Local = inQueueA2.DeQue<int8_t>();
        SplitB(b1Local, idexb1Local);
        Compute(a2Local);
        inQueueB1.FreeTensor(b1Local);
        inQueueIdxB1.FreeTensor(idexb1Local);
        inQueueA2.FreeTensor(a2Local);

        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<int8_t> a1Local = inQueueA1.AllocTensor<int8_t>();
        AscendC::LocalTensor<int8_t> b1Local = inQueueB1.AllocTensor<int8_t>();
        AscendC::LocalTensor<uint8_t> idxb1Local = inQueueIdxB1.AllocTensor<uint8_t>();
        AscendC::DataCopy(a1Local, aGM, { 1, static_cast<uint16_t>(aSize * sizeof(int8_t) / 32), 0, 0 });
        AscendC::DataCopy(b1Local, bGM, { 1, static_cast<uint16_t>(bSize * sizeof(int8_t) / 32), 0, 0 });
        AscendC::DataCopy(idxb1Local, idxGM, { 1, static_cast<uint16_t>(bSize / 4 * sizeof(int8_t) / 32), 0, 0 });

        inQueueA1.EnQue(a1Local);
        inQueueB1.EnQue(b1Local);
        inQueueIdxB1.EnQue(idxb1Local);
    }
    __aicore__ inline void SplitA()
    {
        int srcOffset = 0;
        int dstOffset = 0;
        AscendC::LocalTensor<int8_t> a1Local = inQueueA1.DeQue<int8_t>();
        AscendC::LocalTensor<int8_t> a2Local = inQueueA2.AllocTensor<int8_t>();

        AscendC::LoadData2DParams loadDataParams;
        loadDataParams.repeatTimes = kBlocks * mBlocks;
        loadDataParams.srcStride = 1;
        loadDataParams.ifTranspose = false;

        AscendC::LoadData(a2Local, a1Local, loadDataParams);

        inQueueA2.EnQue<int8_t>(a2Local);
        inQueueA1.FreeTensor(a1Local);
    }
    __aicore__ inline void SplitB(AscendC::LocalTensor<int8_t>& b1Local, AscendC::LocalTensor<uint8_t>& idxb1Local)
    {
        AscendC::LocalTensor<int8_t> b2Local = inQueueB2.AllocTensor<int8_t>();

        // transform nz to zn
        AscendC::LoadData2DParams loadDataParams;
        loadDataParams.repeatTimes = kBlocks * nBlocks / 2;
        loadDataParams.srcStride = 0;
        loadDataParams.ifTranspose = false;

        AscendC::LoadDataWithSparse(b2Local, b1Local, idxb1Local, loadDataParams);

        inQueueB2.EnQue<int8_t>(b2Local);
    }
    __aicore__ inline void Compute(const AscendC::LocalTensor<int8_t>& a2Local)
    {
        AscendC::LocalTensor<int8_t> b2Local = inQueueB2.DeQue<int8_t>();
        AscendC::LocalTensor<int32_t> c1Local = outQueueCO1.AllocTensor<int32_t>();

        AscendC::MmadWithSparse(c1Local, a2Local, b2Local, { m, n, k, false, 0, false, false, false });

        outQueueCO1.EnQue<int32_t>(c1Local);
        inQueueB2.FreeTensor(b2Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<int32_t> c1Local = outQueueCO1.DeQue<int32_t>();

        AscendC::FixpipeParamsV220 fixpipeParams;
        fixpipeParams.nSize = n;
        fixpipeParams.mSize = m;
        fixpipeParams.srcStride = m;
        fixpipeParams.dstStride = n;

        fixpipeParams.ndNum = 1;
        fixpipeParams.srcNdStride = 0;
        fixpipeParams.dstNdStride = 0;

        AscendC::Fixpipe(cGM, c1Local, fixpipeParams);

        outQueueCO1.FreeTensor(c1Local);
    }

private:
    AscendC::TPipe pipe;

    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueIdxB1;
    AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2;
    // dst queue
    AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;

    AscendC::GlobalTensor<int8_t> aGM, bGM;
    AscendC::GlobalTensor<uint8_t> idxGM;
    AscendC::GlobalTensor<int32_t> cGM;

    uint16_t m;
    uint16_t n;
    uint16_t k;

    uint16_t aSize, bSize, cSize, mBlocks, nBlocks, kBlocks;
};

#define KERNEL_MMAD_WITH_SPARSE_OPERATOR_TEST(m, k, n)                                        \
    extern "C" __global__ __aicore__ void kernel_mmad_with_sparse_operator##_##m##_##k##_##n( \
        GM_ADDR a, GM_ADDR b, GM_ADDR idx, GM_ADDR c)                                         \
    {                                                                                         \
        KernelMatmul op;                                                                      \
        op.Init(a, b, idx, c, m, k, n);                                                       \
        op.Process();                                                                         \
    }

KERNEL_MMAD_WITH_SPARSE_OPERATOR_TEST(16, 64, 16)

Parent topic: Cube Computation