LoadDataWithTranspose

Supported Products

Product

Supported/Unsupported

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product's AI Core

x

Atlas inference product's Vector Core

x

Atlas training products

x

Function Usage

Loads 2D data with transposing from A1/B1 to A2/B2.

The following uses an example to describe the API functions and key parameters. In the following figure, an N shape or a Z shape represents a fractal.

  • For the uint8_t/int8_t type, 32 x 32 x 1 bytes data is processed in each iteration. Two fractals (512 bytes each) can be processed. In each iteration, two consecutive 16 x 32 fractals in the source operand are combined into one 32 x 32 square matrix. Transpose is performed based on the square matrix, and then the transposed matrix is split into two 16 x 32 fractals. There may be different arrangements according to parameters such as the fractal interval of the destination operand.
    The following figure shows an example.
    • A total of 3072 bytes of data needs to be processed. 32 x 32 x 1 byte data needs to be processed in each iteration, and three iterations are required. repeatTime = 3;
    • srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 32 x 32 x 1 bytes). The unit is the size of the square matrix after combination.
    • dstGap = 1: The gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand is 1 (unit: 512 bytes).
    • dstFracGap = 0: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 0 (unit: 512 bytes).

    The following figure shows an example.

    • The description of repeatTime and srcStride is the same as that in the preceding figure.
    • dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
    • dstFracGap = 2: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 2 (unit: 512 bytes).

  • For the half/bfloat16_t type, 16 x 16 x 2 bytes data is processed in each iteration. One fractal (512 bytes each) can be processed. In each iteration, one 16 x 16 fractal in the source operand is transposed.
    • A total of 1536 bytes of data needs to be processed. 16 x 16 x 2 bytes data needs to be processed in each iteration, and three iterations are required. repeatTime = 3.
    • srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 16 x 16 x 2 bytes).
    • dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
    • In this scenario, the fractal is a square matrix. Each iteration processes a fractal, and there is no interval between fractals in the iteration. Therefore, this parameter is invalid.

  • For the float/int32_t/uint32_t type, 16 x 16 x 4 bytes data is processed in each iteration. Two fractals (512 bytes each) can be processed. In each iteration, two consecutive 16 x 8 fractals in the source operand are combined into one 16 x 16 square matrix. Transpose is performed based on the square matrix, and then the transposed matrix is split into two 16 x 8 fractals. There may be different arrangements according to parameters such as the fractal interval of the destination operand.
    The following figure shows an example.
    • A total of 3072 bytes of data needs to be processed. 16 x 16 x 4 bytes data needs to be processed in each iteration, and three iterations are required. repeatTime = 3;
    • srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 16 x 16 x 4 bytes). The unit is the size of the square matrix after combination.
    • dstGap = 1: The gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand is 1 (unit: 512 bytes).
    • dstFracGap = 0: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 0 (unit: 512 bytes).

    The following figure shows an example.

    • The description of repeatTime and srcStride is the same as that in the preceding figure.
    • dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
    • dstFracGap = 2: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 2 (unit: 512 bytes).

  • For the int4b_t type, 64 x 64 x 0.5 bytes data is processed in each iteration. Four fractals (512 bytes each) can be processed. In each iteration, four consecutive 16 x 64 fractals in the source operand are combined into one 64 x 64 square matrix. Transpose is performed based on the square matrix, and then the transposed matrix is split into four 16 x 64 fractals. There may be different arrangements according to parameters such as the fractal interval of the destination operand.

    The int4b_t data type requires two numbers to be combined into an number of type int8_t or uint8_t. The combination rule is as follows.

    The following figure shows an example.
    • A total of 6144 bytes of data needs to be processed. 64 x 64 x 0.5 bytes of data needs to be processed in each iteration, and three iterations are required. repeatTime = 3.
    • srcStride = 1: The stride between the start addresses of two adjacent square matrices of the source operand in adjacent iterations is 1 (unit: 64 x 64 x 0.5 bytes). The unit is the size of the square matrix after combination.
    • dstGap = 1: The gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand is 1 (unit: 512 bytes).
    • dstFracGap = 0: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 0 (unit: 512 bytes).

    The following figure shows an example.

    • The description of repeatTime and srcStride is the same as that in the preceding figure.
    • dstGap = 0: There is no gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand.
    • dstFracGap = 2: The gap between the end address of the previous fractal and the start address of the next fractal of the destination operand in each iteration is 2 (unit: 512 bytes).

Prototype

1
2
template <typename T>
__aicore__ inline void LoadDataWithTranspose(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LoadData2dTransposeParams& loadDataParams)

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Atlas A2 training products/Atlas A2 inference products: The supported data type is int4b_t/int8_t/uint8_t/half/bfloat16_t/float/int32_t/uint32_t.

Atlas A3 training products/Atlas A3 inference products: The supported data types are int4b_t, int8_t, uint8_t, half, bfloat16_t, float, int32_t, and uint32_t.

Atlas 200I/500 A2 inference products: The supported data types are int4b_t, uint8_t, int8_t, uint16_t, int16_t, half, bfloat16_t, uint32_t, int32_t, and float.

The int4b_t type is supported only when TPosition of LocalTensor is B2.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand for the result matrix, which is of the LocalTensor type.

For the Atlas A2 training products/Atlas A2 inference products, the supported TPosition is A2 or B2.

For the Atlas A3 training products/Atlas A3 inference products, the supported TPosition is A2 or B2.

For the Atlas 200I/500 A2 inference products, the supported TPosition is A2 or B2.

The start address of LocalTensor must be 512-byte aligned.

The data type must be the same as that of src.

src

Input

Source operand of the LocalTensor type.

For the Atlas A2 training products/Atlas A2 inference products, the supported TPosition is A1 or B1.

For the Atlas A3 training products/Atlas A3 inference products, the supported TPosition is A1 or B1.

For the Atlas 200I/500 A2 inference products, the supported TPosition is A1 or B1.

The start address of LocalTensor must be 32-byte aligned.

The data type must be the same as that of dst.

loadDataParams

Input

Parameters related to LoadDataWithTranspose, of the LoadData2dTransposeParams type.

For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_mm.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

For details about the parameter description, see Table 3.

Table 3 Parameters in the LoadData2dTransposeParams structure

Parameter

Input/Output

Meaning

startIndex

Input

Square matrix ID, indicating the sequence number of the square matrix of the source operand from which the movement starts (0 indicates the first square matrix of the source operand). Value range: startIndex ∈ [0, 65535]. The default value is 0.

For example, if the source operand has 20 fractals whose size is 16 x 8 x 4 bytes (data type: float) and startIndex is 1, the transfer start position is the second square matrix, that is, the third and fourth fractals are transposed from the source operand to the destination operand (the first and second fractals form the first square matrix, and the third and fourth fractals form the second square matrix).

repeatTimes

Input

Number of iterations.

For the uint8_t/int8_t type, 32 x 32 x 1 bytes data is processed in each iteration.

For the half/bfloat16_t type, 16 x 16 x 2 bytes data is processed in each iteration.

For the float/int32_t/uint32_t type, 16 x 16 x 4 bytes data is processed in each iteration.

For the int4b_t type, 16 x 64 x 0.5 bytes data is processed in each iteration.

Value range: repeatTimes ∈ [0, 255]. The default value is 0.

srcStride

Input

In adjacent iterations, the interval between the start addresses of fractals of the source operand. The unit here is actually the size of the combined square matrix.

For the uint8_t/int8_t type, the unit is 32 x 32 x 1 bytes.

For the half/bfloat16_t type, the unit is 16 x 16 x 2 bytes.

For the float/int32_t/uint32_t type, the unit is 16 x 16 x 4 bytes.

For the int4b_t type, 16 x 64 x 0.5 bytes data is processed in each iteration.

Value range: srcStride ∈ [0, 65535]. The default value is 0.

dstGap

Input

Gap between the end address of the first fractal in the previous iteration and the start address of the first fractal in the next iteration of the destination operand (unit: 512 bytes). Value range: dstGap ∈ [0, 65535]. The default value is 0.

dstFracGap

Input

Gap between the previous fractal end address and the next fractal start address for transposing the destination operand in each iteration. The unit is 512 bytes. This parameter is valid only when the data type is float/int32_t/uint32_t/uint8_t/int8_t/int4b_t. Value range: dstFracGap ∈ [0, 65535]. The default value is 0.

addrMode

Input

Reserved. It is reserved for subsequent functions. You can use the default value for now.

Restrictions

  • If repeat is set to 0, no movement operation is performed.
  • Developers need to ensure that the fractals after the transposition of the target operand do not overlap.
  • For details about the operand address alignment requirements, see General Address Alignment Restrictions.

Examples

  • Example 1: In this example, the input matrix a is of the int8_t type and the shape is [16,32], the input matrix b is of the int8_t type and the shape is [32,64], and the output matrix c is of the int32_t type. Matrix a is not transposed from A1 to A2, matrix b is transposed from B1 to B2, and then Mmad and Fixpipe are performed.
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    #include "kernel_operator.h"
    
    template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelMatmul {
    public:
        __aicore__ inline KernelMatmul()
        {
            aSize = m * k;
            bSize = k * n;
            cSize = m * n;
            nBlocks = n / 16;
        }
        __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
        {
            aGM.SetGlobalBuffer((__gm__ fmap_T *)a);
            bGM.SetGlobalBuffer((__gm__ weight_T *)b);
            cGM.SetGlobalBuffer((__gm__ dstCO1_T *)c);
            pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(fmap_T));
            pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(fmap_T));
            pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(weight_T));
            pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(weight_T));
            pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(dstCO1_T));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            SplitA();
            SplitB();
            Compute();
            CopyOut();
        }
    
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.AllocTensor<fmap_T>();
            AscendC::LocalTensor<weight_T> b1Local = inQueueB1.AllocTensor<weight_T>();
    
            AscendC::Nd2NzParams dataCopyA1Params;
            dataCopyA1Params.ndNum = 1;
            dataCopyA1Params.nValue = m;
            dataCopyA1Params.dValue = k;
            dataCopyA1Params.srcNdMatrixStride = 0;
            dataCopyA1Params.srcDValue = k;
            dataCopyA1Params.dstNzC0Stride = m;
            dataCopyA1Params.dstNzNStride = 1;
            dataCopyA1Params.dstNzMatrixStride = 0;
            AscendC::DataCopy(a1Local, aGM, dataCopyA1Params);
    
            AscendC::Nd2NzParams dataCopyB1Params;
            dataCopyB1Params.ndNum = 1;
            dataCopyB1Params.nValue = k;
            dataCopyB1Params.dValue = n;
            dataCopyB1Params.srcNdMatrixStride = 0;
            dataCopyB1Params.srcDValue = n;
            dataCopyB1Params.dstNzC0Stride = k;
            dataCopyB1Params.dstNzNStride = 1;
            dataCopyB1Params.dstNzMatrixStride = 0;
            AscendC::DataCopy(b1Local, bGM, dataCopyB1Params);
    
            inQueueA1.EnQue(a1Local);
            inQueueB1.EnQue(b1Local);
        }
        __aicore__ inline void SplitA()
        {
            AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.DeQue<fmap_T>();
            AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.AllocTensor<fmap_T>();
    
            AscendC::LoadData2DParams loadL0AParams;
            loadL0AParams.repeatTimes = aSize * sizeof(fmap_T) / 512;
            loadL0AParams.srcStride = 1;
            loadL0AParams.ifTranspose = false;
            AscendC::LoadData(a2Local, a1Local, loadL0AParams);
    
            inQueueA2.EnQue<fmap_T>(a2Local);
            inQueueA1.FreeTensor(a1Local);
        }
        __aicore__ inline void SplitB()
        {
            AscendC::LocalTensor<weight_T> b1Local = inQueueB1.DeQue<weight_T>();
            AscendC::LocalTensor<weight_T> b2Local = inQueueB2.AllocTensor<weight_T>();
    
            AscendC::LoadData2dTransposeParams loadDataParams;
            loadDataParams.startIndex = 0;
            nBlockSize = 32;
            loadDataParams.repeatTimes = n / nBlockSize;
            loadDataParams.srcStride = 1;
            loadDataParams.dstGap = 1;
            loadDataParams.dstFracGap = 0;
            AscendC::LoadDataWithTranspose(b2Local, b1Local, loadDataParams);
    
            inQueueB1.FreeTensor(b1Local);
            inQueueB2.EnQue<weight_T>(b2Local);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.DeQue<fmap_T>();
            AscendC::LocalTensor<weight_T> b2Local = inQueueB2.DeQue<weight_T>();
            AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.AllocTensor<dstCO1_T>();
    
            AscendC::MmadParams mmadParams;
            mmadParams.m = m;
            mmadParams.n = n;
            mmadParams.k = k;
            AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams);
    
            outQueueCO1.EnQue<dstCO1_T>(c1Local);
            inQueueA2.FreeTensor(a2Local);
            inQueueB2.FreeTensor(b2Local);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.DeQue<dstCO1_T>();
            AscendC::FixpipeParamsV220 fixpipeParams;
            fixpipeParams.nSize = n;
            fixpipeParams.mSize = m;
            fixpipeParams.srcStride = m;
            fixpipeParams.dstStride = n;
    
            fixpipeParams.ndNum = 1;
            fixpipeParams.srcNdStride = 0;
            fixpipeParams.dstNdStride = 0;
            AscendC::Fixpipe(cGM, c1Local, fixpipeParams);
            outQueueCO1.FreeTensor(c1Local);
        }
    
    private:
        AscendC::TPipe pipe;
    
        AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;
        AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;
        AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;
        AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2;
        // dst queue
        AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
    
        AscendC::GlobalTensor<fmap_T> aGM;
        AscendC::GlobalTensor<weight_T> bGM;
        AscendC::GlobalTensor<dst_T> cGM;
    
        uint16_t m = 16, k = 32, n = 64;
        uint8_t nBlockSize = 16;
        uint16_t c0Size = 16;
        uint16_t aSize, bSize, cSize, nBlocks;
    };
    
    extern "C" __global__ __aicore__ void cube_matmul_loaddata_operator_int8_t(__gm__ uint8_t *a, __gm__ uint8_t *b,
        __gm__ uint8_t *c)
    {
        KernelMatmul<dst_type, fmap_type, weight_type, dstCO1_type> op;
        op.Init(a, b, c);
        op.Process();
    }
    
  • Example 2: In this example, the input matrix a is of the half type and the shape is [16,32], the input matrix b is of the half type and the shape is [32,32], and the output matrix c is of the float type. Matrix a is not transposed from A1 to A2, matrix b is transposed from B1 to B2, and then Mmad and Fixpipe are performed.
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    #include "kernel_operator.h"
    
    template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelMatmul {
    public:
        __aicore__ inline KernelMatmul()
        {
            aSize = m * k;
            bSize = k * n;
            cSize = m * n;
            nBlocks = n / 16;
        }
        __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
        {
            aGM.SetGlobalBuffer((__gm__ fmap_T *)a);
            bGM.SetGlobalBuffer((__gm__ weight_T *)b);
            cGM.SetGlobalBuffer((__gm__ dstCO1_T *)c);
            pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(fmap_T));
            pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(fmap_T));
            pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(weight_T));
            pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(weight_T));
            pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(dstCO1_T));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            SplitA();
            SplitB();
            Compute();
            CopyOut();
        }
    
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.AllocTensor<fmap_T>();
            AscendC::LocalTensor<weight_T> b1Local = inQueueB1.AllocTensor<weight_T>();
    
            AscendC::Nd2NzParams dataCopyA1Params;
            dataCopyA1Params.ndNum = 1;
            dataCopyA1Params.nValue = m;
            dataCopyA1Params.dValue = k;
            dataCopyA1Params.srcNdMatrixStride = 0;
            dataCopyA1Params.srcDValue = k;
            dataCopyA1Params.dstNzC0Stride = m;
            dataCopyA1Params.dstNzNStride = 1;
            dataCopyA1Params.dstNzMatrixStride = 0;
            AscendC::DataCopy(a1Local, aGM, dataCopyA1Params);
    
            AscendC::Nd2NzParams dataCopyB1Params;
            dataCopyB1Params.ndNum = 1;
            dataCopyB1Params.nValue = k;
            dataCopyB1Params.dValue = n;
            dataCopyB1Params.srcNdMatrixStride = 0;
            dataCopyB1Params.srcDValue = n;
            dataCopyB1Params.dstNzC0Stride = k;
            dataCopyB1Params.dstNzNStride = 1;
            dataCopyB1Params.dstNzMatrixStride = 0;
            AscendC::DataCopy(b1Local, bGM, dataCopyB1Params);
    
            inQueueA1.EnQue(a1Local);
            inQueueB1.EnQue(b1Local);
        }
        __aicore__ inline void SplitA()
        {
            AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.DeQue<fmap_T>();
            AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.AllocTensor<fmap_T>();
    
            AscendC::LoadData2DParams loadL0AParams;
            loadL0AParams.repeatTimes = aSize * sizeof(fmap_T) / 512;
            loadL0AParams.srcStride = 1;
            loadL0AParams.ifTranspose = false;
            AscendC::LoadData(a2Local, a1Local, loadL0AParams);
    
            inQueueA2.EnQue<fmap_T>(a2Local);
            inQueueA1.FreeTensor(a1Local);
        }
        __aicore__ inline void SplitB()
        {
            AscendC::LocalTensor<weight_T> b1Local = inQueueB1.DeQue<weight_T>();
            AscendC::LocalTensor<weight_T> b2Local = inQueueB2.AllocTensor<weight_T>();
    
            AscendC::LoadData2dTransposeParams loadDataParams;
            loadDataParams.startIndex = 0;
            nBlockSize = 16;
            loadDataParams.repeatTimes = k / nBlockSize;
            loadDataParams.srcStride = 1;
            loadDataParams.dstGap = 1;
            for (int i = 0; i < (n / nBlockSize); ++i) {
                AscendC::LoadDataWithTranspose(b2Local[i * 16 * nBlockSize], b1Local[i * k * nBlockSize], loadDataParams);
            }
    
            inQueueB1.FreeTensor(b1Local);
            inQueueB2.EnQue<weight_T>(b2Local);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.DeQue<fmap_T>();
            AscendC::LocalTensor<weight_T> b2Local = inQueueB2.DeQue<weight_T>();
            AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.AllocTensor<dstCO1_T>();
    
            AscendC::MmadParams mmadParams;
            mmadParams.m = m;
            mmadParams.n = n;
            mmadParams.k = k;
            AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams);
    
            outQueueCO1.EnQue<dstCO1_T>(c1Local);
            inQueueA2.FreeTensor(a2Local);
            inQueueB2.FreeTensor(b2Local);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.DeQue<dstCO1_T>();
            AscendC::FixpipeParamsV220 fixpipeParams;
            fixpipeParams.nSize = n;
            fixpipeParams.mSize = m;
            fixpipeParams.srcStride = m;
            fixpipeParams.dstStride = n;
    
            fixpipeParams.ndNum = 1;
            fixpipeParams.srcNdStride = 0;
            fixpipeParams.dstNdStride = 0;
            AscendC::Fixpipe(cGM, c1Local, fixpipeParams);
            outQueueCO1.FreeTensor(c1Local);
        }
    
    private:
        AscendC::TPipe pipe;
    
        AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;
        AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;
        AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;
        AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2;
        // dst queue
        AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
    
        AscendC::GlobalTensor<fmap_T> aGM;
        AscendC::GlobalTensor<weight_T> bGM;
        AscendC::GlobalTensor<dst_T> cGM;
    
        uint16_t m = 16, k = 32, n = 32;
        uint8_t nBlockSize = 16;
        uint16_t c0Size = 16;
        uint16_t aSize, bSize, cSize, nBlocks;
    };
    
    extern "C" __global__ __aicore__ void cube_matmul_loaddata_operator_half(__gm__ uint8_t *a, __gm__ uint8_t *b,
        __gm__ uint8_t *c)
    {
        KernelMatmul<dst_type, fmap_type, weight_type, dstCO1_type> op;
        op.Init(a, b, c);
        op.Process();
    }
    
  • Example 3: In this example, the input matrix a is of the float type and the shape is [16,16], the input matrix b is of the float type and the shape is [16,32], and the output matrix c is of the float type. Matrix a is not transposed from A1 to A2, matrix b is transposed from B1 to B2, and then Mmad and Fixpipe are performed.
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    #include "kernel_operator.h"
    
    template <typename dst_T, typename fmap_T, typename weight_T, typename dstCO1_T> class KernelMatmul {
    public:
        __aicore__ inline KernelMatmul()
        {
            aSize = m * k;
            bSize = k * n;
            cSize = m * n;
            nBlocks = n / 16;
        }
        __aicore__ inline void Init(__gm__ uint8_t *a, __gm__ uint8_t *b, __gm__ uint8_t *c)
        {
            aGM.SetGlobalBuffer((__gm__ fmap_T *)a);
            bGM.SetGlobalBuffer((__gm__ weight_T *)b);
            cGM.SetGlobalBuffer((__gm__ dstCO1_T *)c);
            pipe.InitBuffer(inQueueA1, 1, aSize * sizeof(fmap_T));
            pipe.InitBuffer(inQueueA2, 1, aSize * sizeof(fmap_T));
            pipe.InitBuffer(inQueueB1, 1, bSize * sizeof(weight_T));
            pipe.InitBuffer(inQueueB2, 2, bSize * sizeof(weight_T));
            pipe.InitBuffer(outQueueCO1, 1, cSize * sizeof(dstCO1_T));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            SplitA();
            SplitB();
            Compute();
            CopyOut();
        }
    
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.AllocTensor<fmap_T>();
            AscendC::LocalTensor<weight_T> b1Local = inQueueB1.AllocTensor<weight_T>();
    
            AscendC::Nd2NzParams dataCopyA1Params;
            dataCopyA1Params.ndNum = 1;
            dataCopyA1Params.nValue = m;
            dataCopyA1Params.dValue = k;
            dataCopyA1Params.srcNdMatrixStride = 0;
            dataCopyA1Params.srcDValue = k;
            dataCopyA1Params.dstNzC0Stride = m;
            dataCopyA1Params.dstNzNStride = 1;
            dataCopyA1Params.dstNzMatrixStride = 0;
            AscendC::DataCopy(a1Local, aGM, dataCopyA1Params);
    
            AscendC::Nd2NzParams dataCopyB1Params;
            dataCopyB1Params.ndNum = 1;
            dataCopyB1Params.nValue = k;
            dataCopyB1Params.dValue = n;
            dataCopyB1Params.srcNdMatrixStride = 0;
            dataCopyB1Params.srcDValue = n;
            dataCopyB1Params.dstNzC0Stride = k;
            dataCopyB1Params.dstNzNStride = 1;
            dataCopyB1Params.dstNzMatrixStride = 0;
            AscendC::DataCopy(b1Local, bGM, dataCopyB1Params);
    
            inQueueA1.EnQue(a1Local);
            inQueueB1.EnQue(b1Local);
        }
        __aicore__ inline void SplitA()
        {
            AscendC::LocalTensor<fmap_T> a1Local = inQueueA1.DeQue<fmap_T>();
            AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.AllocTensor<fmap_T>();
    
            AscendC::LoadData2DParams loadL0AParams;
            loadL0AParams.repeatTimes = aSize * sizeof(fmap_T) / 512;
            loadL0AParams.srcStride = 1;
            loadL0AParams.ifTranspose = false;
            AscendC::LoadData(a2Local, a1Local, loadL0AParams);
    
            inQueueA2.EnQue<fmap_T>(a2Local);
            inQueueA1.FreeTensor(a1Local);
        }
        __aicore__ inline void SplitB()
        {
            AscendC::LocalTensor<weight_T> b1Local = inQueueB1.DeQue<weight_T>();
            AscendC::LocalTensor<weight_T> b2Local = inQueueB2.AllocTensor<weight_T>();
    
            AscendC::LoadData2dTransposeParams loadDataParams;
            loadDataParams.startIndex = 0;
            nBlockSize = 16;
            loadDataParams.repeatTimes = n / nBlockSize;
            loadDataParams.srcStride = 1;
            loadDataParams.dstGap = 0;
            loadDataParams.dstFracGap = n / nBlockSize - 1;
            AscendC::LoadDataWithTranspose(b2Local, b1Local, loadDataParams);
    
            inQueueB1.FreeTensor(b1Local);
            inQueueB2.EnQue<weight_T>(b2Local);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<fmap_T> a2Local = inQueueA2.DeQue<fmap_T>();
            AscendC::LocalTensor<weight_T> b2Local = inQueueB2.DeQue<weight_T>();
            AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.AllocTensor<dstCO1_T>();
    
            AscendC::MmadParams mmadParams;
            mmadParams.m = m;
            mmadParams.n = n;
            mmadParams.k = k;
            AscendC::Mmad(c1Local, a2Local, b2Local, mmadParams);
    
            outQueueCO1.EnQue<dstCO1_T>(c1Local);
            inQueueA2.FreeTensor(a2Local);
            inQueueB2.FreeTensor(b2Local);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<dstCO1_T> c1Local = outQueueCO1.DeQue<dstCO1_T>();
            AscendC::FixpipeParamsV220 fixpipeParams;
            fixpipeParams.nSize = n;
            fixpipeParams.mSize = m;
            fixpipeParams.srcStride = m;
            fixpipeParams.dstStride = n;
    
            fixpipeParams.ndNum = 1;
            fixpipeParams.srcNdStride = 0;
            fixpipeParams.dstNdStride = 0;
            AscendC::Fixpipe(cGM, c1Local, fixpipeParams);
            outQueueCO1.FreeTensor(c1Local);
        }
    
    private:
        AscendC::TPipe pipe;
    
        AscendC::TQue<AscendC::TPosition::A1, 1> inQueueA1;
        AscendC::TQue<AscendC::TPosition::A2, 1> inQueueA2;
        AscendC::TQue<AscendC::TPosition::B1, 1> inQueueB1;
        AscendC::TQue<AscendC::TPosition::B2, 1> inQueueB2;
        // dst queue
        AscendC::TQue<AscendC::TPosition::CO1, 1> outQueueCO1;
    
        AscendC::GlobalTensor<fmap_T> aGM;
        AscendC::GlobalTensor<weight_T> bGM;
        AscendC::GlobalTensor<dst_T> cGM;
    
        uint16_t m = 16, k = 16, n = 32;
        uint8_t nBlockSize = 16;
        uint16_t c0Size = 16;
        uint16_t aSize, bSize, cSize, nBlocks;
    };
    
    extern "C" __global__ __aicore__ void cube_matmul_loaddata_operator_float(__gm__ uint8_t *a, __gm__ uint8_t *b,
        __gm__ uint8_t *c)
    {
        KernelMatmul<dst_type, fmap_type, weight_type, dstCO1_type> op;
        op.Init(a, b, c);
        op.Process();
    }