Matmul High-level API Enabling MTE2 Preload

Case Study

This case shows the optimized operator performance when MTE2 Preload is enabled in the matrix multiplication of the high-level Matmul API in the matrix multiplication operator scenario. The doMTE2Preload parameter in MatmulConfig is used to enable the preload function in the M or N direction of the matrix, which means that the data of matrix A or B is loaded in advance during the MTE2 interval. After the preload function is enabled, the MTE2 interval can be reduced, thereby improving operator performance. For details about the doMTE2Preload parameter, see MatmulConfig.

Applicable scenarios of enabling MTE2 Preload
The MTE2 pipeline gap is large, and the value of M or N is large.

Restrictions on enabling MTE2 Preload
- MTE2 Preload is valid only when the MDL template or SpecialMDL template is used.
- When the preloading function is enabled in the M or N direction, ensure that all data in the K direction is fully loaded and DoubleBuffer is enabled in the M or N direction.
- The condition for fully loading data in the K direction is that singleK <= baseK x stepK.
- The condition for enabling DoubleBuffer in the M direction is that depthA1 = stepM x stepK x 2.
- The condition for enabling DoubleBuffer in the N direction is that depthB1 = stepN x stepK x 2.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data Type	Format
a	128, 512	float16	ND
b	512, 24576	float16	ND

On the AI processor used in the current case, there are 24 cores in total, and the high-level API Matmul in CUBE_ONLY is enabled for the operator. Use the MDL template. The tiling parameters are as follows:

Original shape: M = 128, N = 24576, K = 512.
Single-core shape: singleCoreM = 128, singleCoreN = 1024, singleCoreK = 512.
Base block shape: baseM = 128, baseN = 128, and baseK = 64.
Tiling parameters related to the L1 cache: stepM = 1, stepN = 1, stepKa = 8, stepKb = 8, depthA1 = 8, depthB1 = 16.

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data, and focus on analyzing Cube and Fixpipe pipelines.

Analyzing Main Bottlenecks

The following figure shows the pipeline before optimization. The M and K directions are fully loaded. Therefore, the A matrix is transferred only once. Because the value of N is large, matrix B is transferred multiple times. You can see that there is a gap between single MTE2 transfers.
The profile data before optimization is as follows, with the average aic_time being 30.88 μs.

Optimization Solution

Enable MTE2 Preload: When creating a Matmul object, enable doMTE2Preload. For details about the sample of enabling MTE2 Preload, see Matmul operator sample for preloading in the M and N directions. The specific procedure is as follows:

Set the doMTE2Preload parameter in the MDL template to 2 to enable the Preload function in the N direction.

        
              // preloadMode = 2
static constexpr MatmulConfig MM_CFG = GetMDLConfig(false, false, preloadMode);

Create a Matmul object based on the customized MatmulConfig template parameters.

        
             AscendC::Matmul<AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType>,
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType>,
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, cType>,
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, biasType>, MM_CFG> matmulObj;

Verifying Optimization Benefits

The following figure shows the pipeline after optimization. The tiling parameters remain unchanged. It can be seen that the matrix B data used for the next computation is loaded in advance, and the MTE2 pipeline gap is shortened.
The following figure shows the profile data after optimization. The average aic_time is 28.50 μs, which is shorter than the 30.88 us before optimization.

Summary

When the MTE2 pipeline gap is large and the value of M or N is large, you can enable MTE2 Preload to load the data of matrix A or B in advance.

Parent topic: Matmul Performance Tuning Cases