Matmul Instructions

Ascend C provides a group of Matmul high-level APIs for users to quickly implement Matmul matrix multiplication.

Matmul formula: C = A × B + Bias.

A and B are the source operands. A is a left matrix with shape [M, K], and B is a right matrix with shape [K, N].
C is the destination operand, which is a matrix that stores the matrix multiplication result. Its shape is [M, N].
Bias indicates the matrix multiplication bias, whose shape is [1, N]. Each row of the A × B result matrix is biased.

Figure 1 Matmul matrix multiplication

The M-axis direction mentioned below is the vertical direction of matrix A, the K-axis direction is the horizontal direction of matrix A or the vertical direction of matrix B, and the N-axis direction is the horizontal direction of matrix B. The last axis specifies the last dimension of the matrix.

The procedure for implementing Matmul matrix multiplication on the kernel side is as follows:

Create a Matmul object.
Perform the initialization operation.
Set the left matrix A, right matrix B, and bias.
Execute the matrix multiplication operation.
End the matrix multiplication operation.

The procedure for using Matmul APIs to implement matrix multiplication is as follows:

Create a Matmul object.

The following is an example of creating a Matmul object:

By default, the MIX mode is used (including cube computation and vector computation). In this scenario, the ASCENDC_CUBE_ONLY macro is not set. If the ASCENDC_CUBE_ONLY macro is used in the program, the ASCEND_IS_AIC and ASCEND_IS_AIV macros must be used to isolate Cube computation from Vector computation.
In CUBE_ONLY (including only cube computation), define the ASCENDC_CUBE_ONLY macro in the code to avoid extra performance overhead.

       
            // In CUBE_ONLY, set this code macro before #include "lib/matmul_intf.h".
// #define ASCENDC_CUBE_ONLY 
#include "lib/matmul_intf.h"

typedef AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, half> aType; 
typedef AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, half> bType; 
typedef AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, float> cType; 
typedef AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, float> biasType; 
AscendC::Matmul<aType, bType, cType, biasType> mm;

During object creation, input the parameter types of matrix A, matrix B, matrix C, and the bias. The type information is defined by MatmulType, including the logical location of memory, data format, and data type.

       
            template <AscendC::TPosition POSITION, CubeFormat FORMAT, typename TYPE, bool ISTRANS = false, LayoutMode LAYOUT = LayoutMode::NONE, bool IBSHARE = false> struct MatmulType {
    constexpr static AscendC::TPosition pos = POSITION;
    constexpr static CubeFormat format = FORMAT;
    using T = TYPE;
    constexpr static bool isTrans = ISTRANS;
    constexpr static LayoutMode layout = LAYOUT;
    constexpr static bool ibShare = IBSHARE;
    
};

**Table 1** **MatmulType** parameters
Parameter	Description
POSITION	Logical memory location. For Atlas A3 training products / Atlas A3 inference products : For matrix A, this parameter can be set to TPosition::GM, TPosition::VECOUT, or TPosition::TSCM. For matrix B, this parameter can be set to TPosition::GM, TPosition::VECOUT, or TPosition::TSCM. For the bias, this parameter can be set to TPosition::GM or TPosition::VECOUT. For matrix C, this parameter can be set to TPosition::GM, TPosition::VECIN, or TPosition::CO1. For Atlas A2 training products / Atlas A2 inference products : For matrix A, this parameter can be set to TPosition::GM, TPosition::VECOUT, or TPosition::TSCM. For matrix B, this parameter can be set to TPosition::GM, TPosition::VECOUT, or TPosition::TSCM. For the bias, this parameter can be set to TPosition::GM or TPosition::VECOUT. For matrix C, this parameter can be set to TPosition::GM, TPosition::VECIN, or TPosition::CO1. Note that when matrix C is set to TPosition::CO1, the data format of matrix C can only be CubeFormat::NZ, and the data type of matrix C can only be float or int32_t. For Atlas inference product 's AI Core: For matrix A, this parameter can be set to TPosition::GM or TPosition::VECOUT. For matrix B, this parameter can be set to TPosition::GM or TPosition::VECOUT. For the bias, this parameter can be set to TPosition::GM or TPosition::VECOUT. For matrix C, this parameter can be set to TPosition::GM or TPosition::VECIN. For Atlas 200I/500 A2 inference products : For matrix A, this parameter can be set to TPosition::GM. For matrix B, this parameter can be set to TPosition::GM. For the bias, this parameter can be set to TPosition::GM. For matrix C, this parameter can be set to TPosition::GM.
FORMAT	Physical layout format of data. For details, see Data Format. For Atlas A3 training products / Atlas A3 inference products : For matrix A, this parameter can be set to CubeFormat::ND, CubeFormat::NZ, or CubeFormat::VECTOR. For matrix B, this parameter can be set to CubeFormat::ND or CubeFormat::NZ. For the bias, this parameter can be set to CubeFormat::ND. CubeFormat::NDCubeFormat::NZCubeFormat::ND_ALIGN For Atlas A2 training products / Atlas A2 inference products : For matrix A, this parameter can be set to CubeFormat::ND, CubeFormat::NZ, or CubeFormat::VECTOR. For matrix B, this parameter can be set to CubeFormat::ND or CubeFormat::NZ. For the bias, this parameter can be set to CubeFormat::ND. CubeFormat::NDCubeFormat::NZCubeFormat::ND_ALIGN For Atlas inference product 's AI Core: For matrix A, this parameter can be set to CubeFormat::ND or CubeFormat::NZ. For matrix B, this parameter can be set to CubeFormat::ND or CubeFormat::NZ. For the bias, this parameter can be set to CubeFormat::ND. CubeFormat::NDCubeFormat::NZCubeFormat::ND_ALIGN Note: For the Atlas inference product 's AI Core, when this parameter is set to CubeFormat::ND for matrix C, the last axis must be 32-byte aligned. For example, if the data type is half, N must be a multiple of 16. For Atlas 200I/500 A2 inference products : For matrix A, this parameter can be set to CubeFormat::ND or CubeFormat::NZ. For matrix B, this parameter can be set to CubeFormat::ND or CubeFormat::NZ. For the bias, this parameter can be set to CubeFormat::ND. CubeFormat::NDCubeFormat::NZ Note: For Atlas 200I/500 A2 inference products , when POSITION is set to TPosition::VECIN or TPosition::TSCM and FORMAT is set to CubeFormat::ND for matrix C, the last axis must be 32-byte aligned. For example, if the data type is half, N must be a multiple of 16. When POSITION is set to TPosition::VECIN or TPosition::TSCM, and FORMAT is set to CubeFormat::NZ for matrix C, N must be a multiple of 16. For details about the alignment restrictions on matrix A, matrix B, and matrix C in CubeFormat::NZ format, see Table 3.
TYPE	Data type. For Atlas A3 training products / Atlas A3 inference products : For matrix A, this parameter can be set to half, float, bfloat16_t, int8_t, or int4b_t. For matrix B, this parameter can be set to half, float, bfloat16_t, int8_t, or int4b_t. For the bias, this parameter can be set to half, float, or int32_t. For matrix C, this parameter can be set to half, float, bfloat16_t, int32_t, or int8_t. For Atlas A2 training products / Atlas A2 inference products : For matrix A, this parameter can be set to half, float, bfloat16_t, int8_t, or int4b_t. For matrix B, this parameter can be set to half, float, bfloat16_t, int8_t, or int4b_t. For the bias, this parameter can be set to half, float, or int32_t. For matrix C, this parameter can be set to half, float, bfloat16_t, int32_t, or int8_t. For Atlas inference product 's AI Core: For matrix A, this parameter can be set to half or int8_t. For matrix B, this parameter can be set to half or int8_t. For the bias, this parameter can be set to float or int32_t. For matrix C, this parameter can be set to half, float, int8_t, or int32_t. For Atlas 200I/500 A2 inference products : For matrix A, this parameter can be set to half, float, bfloat16_t, or int8_t. halffloatbfloat16_tint8_t halffloatint32_t For matrix C, this parameter can be set to half, float, bfloat16_t, or int32_t. Note: Except for int8_t of matrix B, the other data types of matrices A and B must be the same. For details about the data type combinations, see Table 2. When the data types of matrices A and B are int4b_t, the number of data records along the inner axis of the matrix must be an even number. For example, if matrix A is of the int4b_t data type and is not transposed, singleCoreK must be an even number. For details about the usage example of the int4b_t data type, see sample of the Matmul operator with int4 inputs.
ISTRANS	Whether to enable the matrix transpose function. true indicates that the matrix transpose function is enabled. If the function is enabled, isTransposeA and isTransposeB in SetTensorA and SetTensorB are used to set whether to transpose matrix A and matrix B, respectively. If matrix A and matrix B are transposed, Matmul considers that the shape of matrix A is [K, M] and that of matrix B is [N, K]. false (default value) indicates that the matrix transpose function is disabled. If the function is disabled, SetTensorA and SetTensorB cannot be used to transpose matrix A and matrix B. In this case, Matmul considers that the shape of matrix A is [M, K] and that of matrix B is [K, N]. Note that the matrix data on L1 Buffer needs to be fractal-aligned. The L1 memory varies depending on whether matrix A and matrix B are transposed or not. When the matrix transpose function is enabled, ensure that the L1 memory applied according to Matmul Tiling parameters does not exceed the specifications of L1 Buffer. The required L1 memory can be estimated using the following condition: (depthA1 × Ceil(baseM/c0Size) × baseK + depthB1 × Ceil(baseN/c0Size) × baseK) × db × sizeof(dtype) < L1Size, where db indicates whether double buffer is enabled for L1. The value can be 1 (double buffer disabled) or 2 (double buffer enabled). For details about other parameters, see Table 1.
LAYOUT	Data layout format. NONE (default): BatchMatmul is not used. Other options indicate that BatchMatmul is used. NORMAL: BMNK data layout format. For details, see the description of data layout in IterateBatch. BSNGD: data layout after reshaping is performed on the original BSH shape. For details, see the description of data layout in IterateBatch. SBNGD: data layout after reshaping is performed on the original SBH shape. For details, see the description of data layout in IterateBatch. BNGS1S2: matrix multiplication output of the first two data layouts. S1S2 data is stored continuously, and an S1S2 element is the data computed of a batch. For details, see the description of data layouts in IterateBatch.
IBSHARE	Whether to enable IBShare (IntraBlock Share). IBShare allows you to reuse the same matrix A or matrix B data on L1 Buffer. The reused matrix must be fully loaded on L1 Buffer. IBShare can be enabled for either matrix A or matrix B and is used together with the IBShare template. For details about parameter settings, see Table 2. If IBShare is enabled for both matrix A and matrix B, matrix A and matrix B on L1 Buffer are reused at the same time. The following conditions must be met: IBShare must also be enabled for matrix A and matrix B of other Matmul objects in the same operator. Atlas A2 training products / Atlas A2 inference products : Only the IterateAll API can be called to obtain the matrix calculation result, and the result can be output only to GlobalTensor. That is, the calculation result is stored in the address of the global memory. Atlas A3 training products / Atlas A3 inference products : Only the IterateAll API can be called to obtain the matrix calculation result, and the result can be output only to GlobalTensor. That is, the calculation result is stored in the address of the global memory. For Atlas A3 training products / Atlas A3 inference products , this parameter is supported. For Atlas A2 training products / Atlas A2 inference products , this parameter is supported. For Atlas inference product 's AI Core, this parameter is not supported. For Atlas 200I/500 A2 inference products , this parameter is not supported. For details about how to use this parameter, see MatmulABshare sample, sample of enabling IBShare for matrices A and B, and sample of enabling IBShare for matrix B only.

**Table 2** Supported Matmul input and output data types
Matrix A	Matrix B	Bias	Matrix C	Supported Platform
float	float	float/half	float	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products Atlas 200I/500 A2 inference products
half	half	float	float	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products Atlas inference product 's AI Core Atlas 200I/500 A2 inference products
half	half	half	float	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products Atlas 200I/500 A2 inference products
int8_t	int8_t	int32_t	int32_t/half	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products Atlas inference product 's AI Core Atlas 200I/500 A2 inference products
int4b_t	int4b_t	int32_t	int32_t/half	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products
bfloat16_t	bfloat16_t	float	float	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products Atlas 200I/500 A2 inference products
bfloat16_t	bfloat16_t	half	float	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products
half	half	float	int8_t	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products
bfloat16_t	bfloat16_t	float	int8_t	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products
int8_t	int8_t	int32_t	int8_t	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products Atlas inference product 's AI Core
half	half	float	half	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products Atlas inference product 's AI Core Atlas 200I/500 A2 inference products
half	half	half	half	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products Atlas 200I/500 A2 inference products
bfloat16_t	bfloat16_t	float	bfloat16_t	Atlas A3 training products / Atlas A3 inference products Atlas A2 training products / Atlas A2 inference products Atlas 200I/500 A2 inference products
half	int8_t	float	float	Atlas inference product 's AI Core

Perform the initialization operation.

       
            REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), mm, &tiling); // Initialize a Matmul object. For details about the parameters, see section REGIST_MATMUL_OBJ.

Set the left matrix A, right matrix B, and bias.

       
            mm.SetTensorA(gm_a);    // Set the left matrix A.
mm.SetTensorB(gm_b);    // Set the right matrix B.
mm.SetBias(gm_bias);    // Set the bias.

// For the Atlas inference product's AI Core, the SetLocalWorkspace API needs to be called to set the UB space required for calculation.
 mm.SetLocalWorkspace(usedUbBufLen);

Execute the matrix multiplication operation.

You can select one of the following three calling methods:

Call Iterate to complete a single iterative computation, and use a while loop to compute the full data on a single core. The Iterate method allows for flexible control over the number of iterations required to compute the desired amount of data.

          
               // The API internally determines the loop end conditions.
while (mm.Iterate()) {   
    mm.GetTensorC(gm_c); 
}

Call IterateAll to compute all data on a single core. The IterateAll method does not require cyclic iterations and is relatively simple to use.

          
               mm.IterateAll(gm_c);

A user applies for a logical buffer CO1 to store the results of matrix multiplication. The user may call one or multiple Iterate operations to perform one or more rounds of computation. When the results need to be written out, the user calls the Fixpipe interface to transfer the computed results from CO1. After the transfer is completed, the CO1 memory can be released. This method allows the user to flexibly control the timing of computation and data movement. Depending on actual needs, each computation can be followed by an immediate result transfer, or multiple computation results can be accumulated in CO1 and transferred out all at once.

In this calling mode, when creating a Matmul object, you must define the logical memory location of matrix C as TPosition::CO1, the data layout format as CubeFormat::NZ, and the data type as float or int32_t.

Atlas inference product 's AI Core does not support this mode.
Atlas 200I/500 A2 inference products do not support this mode.

           
            
              
              
                // Define the type information of matrix C.
typedef AscendC::MatmulType<AscendC::TPosition::CO1, CubeFormat::NZ, float> cType;
// Create a Matmul object.
AscendC::Matmul<aType, bType, cType, biasType> mm; 

// The user pre-allocates CO1 memory as an l0cTensor.
TQue<TPosition::CO1, 1> CO1_;
// 128 × 1024 is the size of the allocated CO1 memory.
GetTPipePtr()->InitBuffer(CO1_, 1, 128 * 1024);
// L0cT is the data type of matrix C.
// If the data type of matrix A is int8_t or int4b_t, the data type of matrix C is int32_t.
// If the data type of matrix A is half, float, or bfloat16_t, the data type of matrix C is float.
LocalTensor<L0cT> l0cTensor = CO1_.template AllocTensor<L0cT>();

// Pass l0cTensor as the input parameter to Iterate, and output the matrix multiplication result to l0cTensor allocated by the user.
mm.Iterate(false, l0cTensor);

// Call the Fixpipe API to transfer the computation result from CO1 to GM.
FixpipeParamsV220 params;
params.nSize = nSize;
params.mSize = mSize;
params.srcStride = srcStride;
params.dstStride = dstStride;
CO1_.EnQue(l0cTensor);
CO1_.template DeQue<L0cT>();
Fixpipe<cType, L0cT, CFG_ROW_MAJOR>(gm[dstOffset], l0cTensor, params);

//Release the CO1 memory.
CO1_.FreeTensor(l0cTensor);

               

             

           
          

End the matrix multiplication operation.
1

mm.End();

**Table 3** Alignment requirements for matrices in CubeFormat::NZ format
Source/Destination Operand	Outer Axis	Inner Axis
Matrix A/Matrix B	Multiple of 16	Multiple of C0_size
Matrix C	Multiple of 16	Multiple of 16
Matrix C (channel_split enabled)	Multiple of 16	Multiple of C0_size
Matrix C (channel_split disabled)	Multiple of 16	float/int32_t: multiple of 16 half/bfloat16_t/int8_t: multiple of C0_size
Note 1: The value of C0_size depends on the data type. For float or int32_t, C0_size is 8. For half or bfloat16_t, C0_size is 16. For int8_t, C0_size is 32. For int4b_t, C0_size is 64. Note 2: The channel_split function is configured using the isEnableChannelSplit parameter in MatmulConfig. For details, see MatmulConfig.

Header File to Be Included

      
           #include "lib/matmul/matmul_intf.h"

Implementation Principle

Take the input matrix A (GM, ND, half), matrix B (GM, ND, half), and output matrix C (GM, ND, float), with bias not supported, as an example. (GM, ND, half) indicates that data is stored on GM, the data format is ND, and the data type is half. The following figure shows the internal algorithm of the high-level Matmul APIs.

Figure 2 Matmul algorithm diagram

The computation process is as follows:

Migrate data from GM to A1: DataCopy migrates a stepM × baseM × stepKa × baseK matrix block a1 from matrix A each time until matrix A migration is completed. Then, migrate data from GM to B1: DataCopy migrates a stepKb × baseK × stepN × baseN matrix block b1 from matrix B each time until matrix B migration is completed.
Migrate data from A1 to A2: LoadData migrates a baseM × baseK matrix block a0 from a1 each time. Data is moved from B1 to B2 for transposing. Then, LoadData moves a baseK × baseN matrix block from b1 each time, and transposes the matrix block into a baseN × baseK matrix block b0.
Perform matrix multiplication: Each time computation of one matrix block a0 × b0 is completed, a matrix block co1 of baseM × baseN is obtained.
Migrate data from matrix block co1 to co2: DataCopy migrates a baseM × baseN matrix block co1 to a singleCoreM × singleCoreN matrix block co2 each time.
Repeat steps 2 to 4 to compute matrix block a1 × b1.
Migrate data from matrix block co2 to matrix block C: DataCopy migrates a singleCoreM × singleCoreN matrix block co2 to matrix block C each time.
Repeat steps 1 to 6 to complete the computation: matrix A × B = C.

Note: For the meanings of parameters such as stepM and baseM, see Tiling parameters.

Parent topic: Matmul Kernel APIs