Enabling MTE2 Preload Using the Matmul High-Level API

Case Study

This case demonstrates how to use the Matmul high-level API to perform matrix multiplication in the matrix multiplication operator scenario and enable MTE2 preload to improve the operator performance. The doMTE2Preload parameter in MatmulConfig is used to enable the preloading function in the M or N direction. Preloading means that the data of matrix A or B is loaded in advance during the MTE2 gap. After the preloading function is enabled, the MTE2 gap is reduced and the operator performance is improved. For details about the doMTE2Preload parameter, see MatmulConfig.

  • Application scenarios of enabling MTE2 preload

    The MTE2 pipeline gap is large, and the value of M or N is large.

  • Restrictions on enabling MTE2 preload
    • MTE2 preload is valid only when the MDL and SpecialMDL templates are used.
    • When the preloading function is enabled in the M or N direction, ensure that all data in the K direction is fully loaded and DoubleBuffer is enabled in the M or N direction.
    • The condition for fully loading data in the K direction is that singleK <= baseK x stepK.
    • The condition for enabling DoubleBuffer in the M direction is that depthA1 = stepM x stepK x 2.
    • The condition for enabling DoubleBuffer in the N direction is that depthB1 = stepN x stepK x 2.

The operator specifications are as follows.

Table 1 Operator specifications

Input

Shape

Data type

Format

a

128, 512

float16

ND

b

512, 24576

float16

ND

The AI processor used in this case has 24 cores, and the pure Cube mode of the Matmul high-level API is enabled in the operator. The MDL template is used. The tiling parameters are as follows:

  • Original shape: M = 128, N = 24576, K = 512.
  • Single-core shape: singleCoreM = 128, singleCoreN = 1024, singleCoreK = 512.
  • Basic block shape: baseM = 128, baseN = 128, baseK = 64.
  • Tiling parameters related to the L1 cache: stepM = 1, stepN = 1, stepKa = 8, stepKb = 8, depthA1 = 8, depthB1 = 16.

Obtaining Profile Data

Use the msProf tool to obtain the operator simulation pipeline diagram and on-board profiling data, and analyze the pipeline status of Cube and Fixpipe.

Analyzing Main Bottlenecks

  • The following figure shows the pipeline before optimization. The M and K directions are fully loaded. Therefore, the A matrix is moved only once. Because N is large, the B matrix is moved for multiple times. It can be seen that there is a gap between single MTE2s.
  • The following figure shows the profiling data before optimization. The average aic_time is 30.88 μs.

Optimization Solution

Enable the MTE2 preload function. When creating a Matmul object, enable the doMTE2Preload function. For details about the complete example of enabling MTE2 preload, see Matmul operator sample for preloading in the M and N directions. Perform the following steps:

  1. Set the doMTE2Preload parameter in the MDL template to 2 to enable the preload function in the N direction.
    1
    2
     // preloadMode = 2
    static constexpr MatmulConfig MM_CFG = GetMDLConfig(false, false, preloadMode); 
    
  2. Create a Matmul object based on the customized MatmulConfig template parameters.
    1
    2
    3
    4
    AscendC::Matmul<AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType>,
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType>,
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, cType>,
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, biasType>, MM_CFG> matmulObj;
    

Verifying Optimization Benefits

  • The following figure shows the pipeline after optimization. The tiling parameters remain unchanged. It can be seen that the B matrix data used in the next computation is loaded in advance, and the gap between MTE2s is shortened.
  • The following figure shows the profiling data after optimization. The average aic_time is 28.50 μs, which is lower than the 30.88 μs before optimization.

Congratulations

If the gap between MTE2s is large and the value of M or N is large, you can enable the MTE2 preload function to load the data of matrix A or B in advance.