LoadDataWithTranspose

Product Support

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product's AI Core	x
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Loads 2D data with transposing from A1/B1 to A2/B2.

The following uses an example to describe the API functions and key parameters. In the following figure, an N shape or a Z shape represents a fractal.

For the uint8_t and int8_t data types, each iteration processes 32 × 32 × 1 bytes of data, which can be divided into 2 tiles (each tile is 512 bytes). In each iteration, two consecutive 16 × 32 tiles in the source operand are combined into one 32 × 32 square matrix. A transpose operation is then performed on this square matrix. After transposition, the matrix is split back into two 16 × 32 tiles. These tiles can be arranged in different layouts according to parameters such as the tile spacing in the destination operand.
The following figure shows an example.
- A total of 3072 bytes of data needs to be processed. Each iteration processes 32 × 32 × 1 bytes of data, so 3 iterations are required to complete the processing. Therefore, repeatTime = 3.
- srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 32 x 32 x 1 bytes). The unit is the size of the square matrix after combination.
- dstGap = 1: The gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand is 1 (unit: 512 bytes).
- dstFracGap = 0: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 0 (unit: 512 bytes).
The following figure shows an example.
- The description of repeatTime and srcStride is the same as that in the preceding figure.
- dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
- dstFracGap = 2: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 2 (unit: 512 bytes).

For the half and bfloat16_t data types, each iteration processes 16 × 16 × 2 bytes of data, which corresponds to one tile (each tile is 512 bytes). In each iteration, a single 16 × 16 tile in the source operand is transposed.
- A total of 1536 bytes of data needs to be processed. Each iteration processes 16 × 16 × 2 bytes of data, so 3 iterations are required to complete the processing. Therefore, repeatTime = 3.
- srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 16 x 16 x 2 bytes).
- dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
- In this scenario, the fractal is a square matrix. Each iteration processes a fractal, and there is no interval between fractals in the iteration. Therefore, this parameter is invalid.

For the float, int32_t, and uint32_t data types, each iteration processes 16 × 16 × 4 bytes of data, which can be divided into 2 tiles (each tile is 512 bytes). In each iteration, two consecutive 16 × 8 tiles in the source operand are combined into one 16 × 16 square matrix. A transpose operation is then performed on this square matrix. After transposition, the matrix is split back into two 16 × 8 tiles. These tiles can be arranged in different layouts according to parameters such as the tile spacing in the destination operand.
The following figure shows an example.
- A total of 3072 bytes of data needs to be processed. Each iteration processes 16 × 16 × 4 bytes of data, so 3 iterations are required to complete the processing. Therefore, repeatTime = 3.
- srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 16 x 16 x 4 bytes). The unit is the size of the square matrix after combination.
- dstGap = 1: The gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand is 1 (unit: 512 bytes).
- dstFracGap = 0: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 0 (unit: 512 bytes).
The following figure shows an example.
- The description of repeatTime and srcStride is the same as that in the preceding figure.
- dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
- dstFracGap = 2: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 2 (unit: 512 bytes).
For the int4b_t data type, each iteration processes 64 × 64 × 0.5 bytes of data, which can be divided into 4 tiles (each tile is 512 bytes). In each iteration, four consecutive 16 × 64 tiles in the source operand are combined into one 64 × 64 square matrix. A transpose operation is then performed on this square matrix. After transposition, the matrix is split back into four 16 × 64 tiles. These tiles can be arranged in different layouts according to parameters such as the tile spacing in the destination operand.
The int4b_t data type requires two numbers to be combined into an number of type int8_t or uint8_t. The combination rule is as follows.
The following figure shows an example.
- A total of 6144 bytes of data needs to be processed. Each iteration processes 64 × 64 × 0.5 bytes of data, so 3 iterations are required to complete the processing. Therefore, repeatTime = 3.
- srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 64 x 64 x 0.5 bytes). The unit is the size of the square matrix after combination.
- dstGap = 1: The gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand is 1 (unit: 512 bytes).
- dstFracGap = 0: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 0 (unit: 512 bytes).
The following figure shows an example.
- The description of repeatTime and srcStride is the same as that in the preceding figure.
- dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
- dstFracGap = 2: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 2 (unit: 512 bytes).

Prototype

template <typename T>
__aicore__ inline void LoadDataWithTranspose(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LoadData2dTransposeParams& loadDataParams)

Parameters

**Table 1** Template parameters
Parameter	Description
T	For Atlas A2 training products/Atlas A2 inference products, the supported data types are int4b_t, int8_t, uint8_t, half, bfloat16_t, float, int32_t, and uint32_t. For Atlas A3 training products/Atlas A3 inference products, the supported data types are int4b_t, int8_t, uint8_t, half, bfloat16_t, float, int32_t, and uint32_t. For the Atlas 200I/500 A2 inference products, the supported data types are int4b_t, uint8_t, int8_t, uint16_t, int16_t, half, bfloat16_t, uint32_t, int32_t, and float. The int4b_t type is supported only when TPosition of the LocalTensor is B2.

**Table 2** Parameters
Parameter	Input/Output	Meaning
dst	Output	Destination operand for the result matrix, which is of the LocalTensor type. For Atlas A2 training products/Atlas A2 inference products, TPosition can only be A2 or B2. For Atlas A3 training products/Atlas A3 inference products, TPosition can only be A2 or B2. For the Atlas 200I/500 A2 inference products, TPosition can only be A2 or B2. The start address of LocalTensor must be 512-byte aligned. The data type must be the same as that of src.
src	Input	Source operand, which is of the LocalTensor type. For Atlas A2 training products/Atlas A2 inference products, TPosition can only be A1 or B1. For Atlas A3 training products/Atlas A3 inference products, TPosition can only be A1 or B1. For the Atlas 200I/500 A2 inference products, TPosition can only be A1 or B1. The start address of LocalTensor must be 32-byte aligned. The data type must be the same as that of dst.
loadDataParams	Input	Parameters related to LoadDataWithTranspose, which is of the LoadData2dTransposeParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_mm.h. Replace *${INSTALL_DIR}* with the CANN installation path. For details about the parameter description, see Table 3.

**Table 3** Parameters in the LoadData2dTransposeParams structure
Parameter	Input/Output	Meaning
startIndex	Input	Tile matrix ID, which specifies the starting tile matrix of the source operand for data transfer, where 0 represents the first tile matrix. Value range: startIndex ∈ [0, 65535]. The default value is 0. For example, if the source operand has 20 fractals whose size is 16 x 8 x 4 bytes (data type: float) and startIndex is 1, the transfer start position is the second square matrix, that is, the third and fourth fractals are transposed from the source operand to the destination operand (the first and second fractals form the first square matrix, and the third and fourth fractals form the second square matrix).
repeatTimes	Input	Number of iterations. For the uint8_t/int8_t type, 32 x 32 x 1 bytes data is processed in each iteration. For the half/bfloat16_t type, 16 x 16 x 2 bytes data is processed in each iteration. For the float/int32_t/uint32_t type, 16 x 16 x 4 bytes data is processed in each iteration. For the int4b_t type, 16 x 64 x 0.5 bytes data is processed in each iteration. Value range: repeatTimes ∈ [0, 255]. The default value is 0.
srcStride	Input	In adjacent iterations, the interval between the start addresses of fractals of the source operand. The unit here is actually the size of the combined square matrix. For the uint8_t/int8_t type, the unit is 32 x 32 x 1 bytes. For the half/bfloat16_t type, the unit is 16 x 16 x 2 bytes. For the float/int32_t/uint32_t type, the unit is 16 x 16 x 4 bytes. For the int4b_t type, 16 x 64 x 0.5 bytes data is processed in each iteration. Value range: srcStride ∈ [0, 65535]. The default value is 0.
dstGap	Input	Gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand (unit: 512 bytes). Value range: dstGap ∈ [0, 65535]. The default value is 0.
dstFracGap	Input	Gap between the previous fractal end address and the next fractal start address for transposing the destination operand in each iteration. The unit is 512 bytes. This parameter is valid only when the data type is float/int32_t/uint32_t/uint8_t/int8_t/int4b_t. Value range: dstFracGap ∈ [0, 65535]. The default value is 0.
addrMode	Input	Reserved parameter for future use. You can use the default value.

Restrictions

If repeat is set to 0, no movement operation is performed.
Developers need to ensure that the fractals after the transposition of the target operand do not overlap.
For details about the operand address alignment requirements, see General Address Alignment Restrictions.

Example

Example 1: In this example, the input matrix a is of the int8_t type and the shape is [16,32], the input matrix b is of the int8_t type and the shape is [32,64], and the output matrix c is of the int32_t type. Matrix a is not transposed from A1 to A2, matrix b is transposed from B1 to B2, and then Mmad and Fixpipe are performed.

#include "kernel_operator.h"

template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelMatmul {
public:
    __aicore__ inline KernelMatmul()
    {
        aSize = m * k;
        bSize = k * n;
        cSize = m * n;
        nBlocks = n / 16;
    }
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
    {
        aGM.SetGlobalBuffer((__gm__ fmap_T *)a);
        bGM.SetGlobalBuffer((__gm__ weight_T *)b);
        cGM.SetGlobalBuffer((__gm__ dstCO1_T *)c);
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(fmap_T));
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(fmap_T));
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(weight_T));
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(weight_T));
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(dstCO1_T));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        SplitA();
        SplitB();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.AllocTensor<fmap_T>();
        AscendC::LocalTensor<weight_T> b1Local = inQueueB1.AllocTensor<weight_T>();

        AscendC::Nd2NzParams dataCopyA1Params;
        dataCopyA1Params.ndNum = 1;
        dataCopyA1Params.nValue = m;
        dataCopyA1Params.dValue = k;
        dataCopyA1Params.srcNdMatrixStride = 0;
        dataCopyA1Params.srcDValue = k;
        dataCopyA1Params.dstNzC0Stride = m;
        dataCopyA1Params.dstNzNStride = 1;
        dataCopyA1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(a1Local, aGM, dataCopyA1Params);

        AscendC::Nd2NzParams dataCopyB1Params;
        dataCopyB1Params.ndNum = 1;
        dataCopyB1Params.nValue = k;
        dataCopyB1Params.dValue = n;
        dataCopyB1Params.srcNdMatrixStride = 0;
        dataCopyB1Params.srcDValue = n;
        dataCopyB1Params.dstNzC0Stride = k;
        dataCopyB1Params.dstNzNStride = 1;
        dataCopyB1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(b1Local, bGM, dataCopyB1Params);

        inQueueA1.EnQue(a1Local);
        inQueueB1.EnQue(b1Local);
    }
    __aicore__ inline void SplitA()
    {
        AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.DeQue<fmap_T>();
        AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.AllocTensor<fmap_T>();

        AscendC::LoadData2DParams loadL0AParams;
        loadL0AParams.repeatTimes = aSize * sizeof(fmap_T) / 512;
        loadL0AParams.srcStride = 1;
        loadL0AParams.ifTranspose = false;
        AscendC::LoadData(a2Local, a1Local, loadL0AParams);

        inQueueA2.EnQue<fmap_T>(a2Local);
        inQueueA1.FreeTensor(a1Local);
    }
    __aicore__ inline void SplitB()
    {
        AscendC::LocalTensor<weight_T> b1Local = inQueueB1.DeQue<weight_T>();
        AscendC::LocalTensor<weight_T> b2Local = inQueueB2.AllocTensor<weight_T>();

        AscendC::LoadData2dTransposeParams loadDataParams;
        loadDataParams.startIndex = 0;
        nBlockSize = 32;
        loadDataParams.repeatTimes = n / nBlockSize;
        loadDataParams.srcStride = 1;
        loadDataParams.dstGap = 1;
        loadDataParams.dstFracGap = 0;
        AscendC::LoadDataWithTranspose(b2Local, b1Local, loadDataParams);

        inQueueB1.FreeTensor(b1Local);
        inQueueB2.EnQue<weight_T>(b2Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.DeQue<fmap_T>();
        AscendC::LocalTensor<weight_T> b2Local = inQueueB2.DeQue<weight_T>();
        AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.AllocTensor<dstCO1_T>();

        AscendC::MmadParams mmadParams;
        mmadParams.m = m;
        mmadParams.n = n;
        mmadParams.k = k;
        AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams);

        outQueueCO1.EnQue<dstCO1_T>(c1Local);
        inQueueA2.FreeTensor(a2Local);
        inQueueB2.FreeTensor(b2Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.DeQue<dstCO1_T>();
        AscendC::FixpipeParamsV220 fixpipeParams;
        fixpipeParams.nSize = n;
        fixpipeParams.mSize = m;
        fixpipeParams.srcStride = m;
        fixpipeParams.dstStride = n;

        fixpipeParams.ndNum = 1;
        fixpipeParams.srcNdStride = 0;
        fixpipeParams.dstNdStride = 0;
        AscendC::Fixpipe(cGM, c1Local, fixpipeParams);
        outQueueCO1.FreeTensor(c1Local);
    }

private:
    AscendC::TPipe pipe;

    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;
    AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2;
    // dst queue
    AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;

    AscendC::GlobalTensor<fmap_T> aGM;
    AscendC::GlobalTensor<weight_T> bGM;
    AscendC::GlobalTensor<dst_T> cGM;

    uint16_t m = 16, k = 32, n = 64;
    uint8_t nBlockSize = 16;
    uint16_t c0Size = 16;
    uint16_t aSize, bSize, cSize, nBlocks;
};

extern "C" __global__ __aicore__ void cube_matmul_loaddata_operator_int8_t(__gm__ uint8_t *a, __gm__ uint8_t *b,
    __gm__ uint8_t *c)
{
    KernelMatmul<dst_type, fmap_type, weight_type, dstCO1_type> op;
    op.Init(a, b, c);
    op.Process();
}

Example 2: In this example, the input matrix a is of the half type and the shape is [16,32], the input matrix b is of the half type and the shape is [32,32], and the output matrix c is of the float type. Matrix a is not transposed from A1 to A2, matrix b is transposed from B1 to B2, and then Mmad and Fixpipe are performed.

#include "kernel_operator.h"

template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelMatmul {
public:
    __aicore__ inline KernelMatmul()
    {
        aSize = m * k;
        bSize = k * n;
        cSize = m * n;
        nBlocks = n / 16;
    }
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
    {
        aGM.SetGlobalBuffer((__gm__ fmap_T *)a);
        bGM.SetGlobalBuffer((__gm__ weight_T *)b);
        cGM.SetGlobalBuffer((__gm__ dstCO1_T *)c);
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(fmap_T));
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(fmap_T));
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(weight_T));
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(weight_T));
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(dstCO1_T));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        SplitA();
        SplitB();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.AllocTensor<fmap_T>();
        AscendC::LocalTensor<weight_T> b1Local = inQueueB1.AllocTensor<weight_T>();

        AscendC::Nd2NzParams dataCopyA1Params;
        dataCopyA1Params.ndNum = 1;
        dataCopyA1Params.nValue = m;
        dataCopyA1Params.dValue = k;
        dataCopyA1Params.srcNdMatrixStride = 0;
        dataCopyA1Params.srcDValue = k;
        dataCopyA1Params.dstNzC0Stride = m;
        dataCopyA1Params.dstNzNStride = 1;
        dataCopyA1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(a1Local, aGM, dataCopyA1Params);

        AscendC::Nd2NzParams dataCopyB1Params;
        dataCopyB1Params.ndNum = 1;
        dataCopyB1Params.nValue = k;
        dataCopyB1Params.dValue = n;
        dataCopyB1Params.srcNdMatrixStride = 0;
        dataCopyB1Params.srcDValue = n;
        dataCopyB1Params.dstNzC0Stride = k;
        dataCopyB1Params.dstNzNStride = 1;
        dataCopyB1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(b1Local, bGM, dataCopyB1Params);

        inQueueA1.EnQue(a1Local);
        inQueueB1.EnQue(b1Local);
    }
    __aicore__ inline void SplitA()
    {
        AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.DeQue<fmap_T>();
        AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.AllocTensor<fmap_T>();

        AscendC::LoadData2DParams loadL0AParams;
        loadL0AParams.repeatTimes = aSize * sizeof(fmap_T) / 512;
        loadL0AParams.srcStride = 1;
        loadL0AParams.ifTranspose = false;
        AscendC::LoadData(a2Local, a1Local, loadL0AParams);

        inQueueA2.EnQue<fmap_T>(a2Local);
        inQueueA1.FreeTensor(a1Local);
    }
    __aicore__ inline void SplitB()
    {
        AscendC::LocalTensor<weight_T> b1Local = inQueueB1.DeQue<weight_T>();
        AscendC::LocalTensor<weight_T> b2Local = inQueueB2.AllocTensor<weight_T>();

        AscendC::LoadData2dTransposeParams loadDataParams;
        loadDataParams.startIndex = 0;
        nBlockSize = 16;
        loadDataParams.repeatTimes = k / nBlockSize;
        loadDataParams.srcStride = 1;
        loadDataParams.dstGap = 1;
        for (int i = 0; i < (n / nBlockSize); ++i) {
            AscendC::LoadDataWithTranspose(b2Local[i * 16 * nBlockSize], b1Local[i * k * nBlockSize], loadDataParams);
        }

        inQueueB1.FreeTensor(b1Local);
        inQueueB2.EnQue<weight_T>(b2Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.DeQue<fmap_T>();
        AscendC::LocalTensor<weight_T> b2Local = inQueueB2.DeQue<weight_T>();
        AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.AllocTensor<dstCO1_T>();

        AscendC::MmadParams mmadParams;
        mmadParams.m = m;
        mmadParams.n = n;
        mmadParams.k = k;
        AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams);

        outQueueCO1.EnQue<dstCO1_T>(c1Local);
        inQueueA2.FreeTensor(a2Local);
        inQueueB2.FreeTensor(b2Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.DeQue<dstCO1_T>();
        AscendC::FixpipeParamsV220 fixpipeParams;
        fixpipeParams.nSize = n;
        fixpipeParams.mSize = m;
        fixpipeParams.srcStride = m;
        fixpipeParams.dstStride = n;

        fixpipeParams.ndNum = 1;
        fixpipeParams.srcNdStride = 0;
        fixpipeParams.dstNdStride = 0;
        AscendC::Fixpipe(cGM, c1Local, fixpipeParams);
        outQueueCO1.FreeTensor(c1Local);
    }

private:
    AscendC::TPipe pipe;

    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;
    AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2;
    // dst queue
    AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;

    AscendC::GlobalTensor<fmap_T> aGM;
    AscendC::GlobalTensor<weight_T> bGM;
    AscendC::GlobalTensor<dst_T> cGM;

    uint16_t m = 16, k = 32, n = 32;
    uint8_t nBlockSize = 16;
    uint16_t c0Size = 16;
    uint16_t aSize, bSize, cSize, nBlocks;
};

extern "C" __global__ __aicore__ void cube_matmul_loaddata_operator_half(__gm__ uint8_t *a, __gm__ uint8_t *b,
    __gm__ uint8_t *c)
{
    KernelMatmul<dst_type, fmap_type, weight_type, dstCO1_type> op;
    op.Init(a, b, c);
    op.Process();
}

Example 3: In this example, the input matrix a is of the float type and the shape is [16,16], the input matrix b is of the float type and the shape is [16,32], and the output matrix c is of the float type. Matrix a is not transposed from A1 to A2, matrix b is transposed from B1 to B2, and then Mmad and Fixpipe are performed.

#include "kernel_operator.h"

template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelMatmul {
public:
    __aicore__ inline KernelMatmul()
    {
        aSize = m * k;
        bSize = k * n;
        cSize = m * n;
        nBlocks = n / 16;
    }
    __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
    {
        aGM.SetGlobalBuffer((__gm__ fmap_T *)a);
        bGM.SetGlobalBuffer((__gm__ weight_T *)b);
        cGM.SetGlobalBuffer((__gm__ dstCO1_T *)c);
        pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(fmap_T));
        pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(fmap_T));
        pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(weight_T));
        pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(weight_T));
        pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(dstCO1_T));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        SplitA();
        SplitB();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.AllocTensor<fmap_T>();
        AscendC::LocalTensor<weight_T> b1Local = inQueueB1.AllocTensor<weight_T>();

        AscendC::Nd2NzParams dataCopyA1Params;
        dataCopyA1Params.ndNum = 1;
        dataCopyA1Params.nValue = m;
        dataCopyA1Params.dValue = k;
        dataCopyA1Params.srcNdMatrixStride = 0;
        dataCopyA1Params.srcDValue = k;
        dataCopyA1Params.dstNzC0Stride = m;
        dataCopyA1Params.dstNzNStride = 1;
        dataCopyA1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(a1Local, aGM, dataCopyA1Params);

        AscendC::Nd2NzParams dataCopyB1Params;
        dataCopyB1Params.ndNum = 1;
        dataCopyB1Params.nValue = k;
        dataCopyB1Params.dValue = n;
        dataCopyB1Params.srcNdMatrixStride = 0;
        dataCopyB1Params.srcDValue = n;
        dataCopyB1Params.dstNzC0Stride = k;
        dataCopyB1Params.dstNzNStride = 1;
        dataCopyB1Params.dstNzMatrixStride = 0;
        AscendC::DataCopy(b1Local, bGM, dataCopyB1Params);

        inQueueA1.EnQue(a1Local);
        inQueueB1.EnQue(b1Local);
    }
    __aicore__ inline void SplitA()
    {
        AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.DeQue<fmap_T>();
        AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.AllocTensor<fmap_T>();

        AscendC::LoadData2DParams loadL0AParams;
        loadL0AParams.repeatTimes = aSize * sizeof(fmap_T) / 512;
        loadL0AParams.srcStride = 1;
        loadL0AParams.ifTranspose = false;
        AscendC::LoadData(a2Local, a1Local, loadL0AParams);

        inQueueA2.EnQue<fmap_T>(a2Local);
        inQueueA1.FreeTensor(a1Local);
    }
    __aicore__ inline void SplitB()
    {
        AscendC::LocalTensor<weight_T> b1Local = inQueueB1.DeQue<weight_T>();
        AscendC::LocalTensor<weight_T> b2Local = inQueueB2.AllocTensor<weight_T>();

        AscendC::LoadData2dTransposeParams loadDataParams;
        loadDataParams.startIndex = 0;
        nBlockSize = 16;
        loadDataParams.repeatTimes = n / nBlockSize;
        loadDataParams.srcStride = 1;
        loadDataParams.dstGap = 0;
        loadDataParams.dstFracGap = n / nBlockSize - 1;
        AscendC::LoadDataWithTranspose(b2Local, b1Local, loadDataParams);

        inQueueB1.FreeTensor(b1Local);
        inQueueB2.EnQue<weight_T>(b2Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.DeQue<fmap_T>();
        AscendC::LocalTensor<weight_T> b2Local = inQueueB2.DeQue<weight_T>();
        AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.AllocTensor<dstCO1_T>();

        AscendC::MmadParams mmadParams;
        mmadParams.m = m;
        mmadParams.n = n;
        mmadParams.k = k;
        AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams);

        outQueueCO1.EnQue<dstCO1_T>(c1Local);
        inQueueA2.FreeTensor(a2Local);
        inQueueB2.FreeTensor(b2Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.DeQue<dstCO1_T>();
        AscendC::FixpipeParamsV220 fixpipeParams;
        fixpipeParams.nSize = n;
        fixpipeParams.mSize = m;
        fixpipeParams.srcStride = m;
        fixpipeParams.dstStride = n;

        fixpipeParams.ndNum = 1;
        fixpipeParams.srcNdStride = 0;
        fixpipeParams.dstNdStride = 0;
        AscendC::Fixpipe(cGM, c1Local, fixpipeParams);
        outQueueCO1.FreeTensor(c1Local);
    }

private:
    AscendC::TPipe pipe;

    AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;
    AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;
    AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;
    AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2;
    // dst queue
    AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;

    AscendC::GlobalTensor<fmap_T> aGM;
    AscendC::GlobalTensor<weight_T> bGM;
    AscendC::GlobalTensor<dst_T> cGM;

    uint16_t m = 16, k = 16, n = 32;
    uint8_t nBlockSize = 16;
    uint16_t c0Size = 16;
    uint16_t aSize, bSize, cSize, nBlocks;
};

extern "C" __global__ __aicore__ void cube_matmul_loaddata_operator_float(__gm__ uint8_t *a, __gm__ uint8_t *b,
    __gm__ uint8_t *c)
{
    KernelMatmul<dst_type, fmap_type, weight_type, dstCO1_type> op;
    op.Init(a, b, c);
    op.Process();
}

Parent topic: Data Movement