Matmul High-level API Enabling Full Static Tiling

Case Study

This case shows how to improve the operator performance by enabling full Matmul static tiling when the Matmul high-level API is used for matrix multiplication computation. During the initialization and iteration of the Matmul API, there is a large amount of scalar computation. The scalar computation during Matmul initialization affects the instruction header overhead, and the scalar computation between Matmul iterations may block the MTE2 pipeline. When the Matmul API is called to implement matrix multiplication, the MatmulApiStaticTiling parameter is used to replace the TCubeTiling variable parameter, and the scalar computation is performed in advance in the compilation phase to reduce the scalar computation overhead during runtime and improve the operator performance.

Matmul static tiling is applicable to the following scenarios:
- A large number of Scalar computations are performed during Matmul initialization, affecting the instruction header overhead.
- A large number of Scalar computations are performed between Matmul iterations, blocking the MTE2 pipeline.
Some tiling parameters need to be determined during compilation for Matmul Tiling constants. Based on the determined parameters, full constants and partial constants are used. Matmul Tiling constants must meet the conditions of either of the two scenarios.
- Full constants: The constant singleCore Shape (singleCoreM/singleCoreN/singleCoreK) and constant base Shape (basicM/basicN/basicK, or baseM/baseN/baseK) can be determined.
- Partial constants: The constant base Shape (basicM/basicN/basicK, or baseM/baseN/baseK) can be determined.
Full constants can reduce more Scalar computation overheads than partial constants.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data Type	Format
a	128, 64	float16	ND
b	64, 30720	float16	ND

On the AI processor used in the current case, there are 24 cores in total, and each core contains one AIC and two AIVs.

The tiling parameters are as follows:

Original shape: M = 128, N = 30720, K = 64.
Single-core shape: The tiling is performed based on 24 AICs. singleCoreM = 128, singleCoreN = 1280, and singleCoreK = 64.
For matrix B, the tiling is performed along the N axis, resulting in 24 single-core tiles (singleCoreN). A single core processes K x singleCoreN data. For matrix A, the M axis is not tiled, that is, singleCoreM = M. A single core processes singleCoreN x K data. A total of 24 cores are involved in the computation.
Basic block shape: baseM = 128, baseN = 256, and baseK = 64.
L1-related tiling parameters: stepM = 1, stepN = 1, stepKa = 4, stepKb = 4, depthA1 = 8, and depthB1 = 8.

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data. Compared with basic scenarios, static tiling converts some or all tiling parameters from variables to constants during compilation. During operator execution, the static tiling parameters are directly used to reduce the scalar performance overhead. Therefore, the focus is on scalar pipeline analysis.

Analyzing Main Bottlenecks

The following figure shows the pipeline before optimization. By default, static tiling is disabled. Tiling parameters need to be copied from the host to the kernel. As a result, a large amount of scalar computation are performed during Matmul initialization. The first MTE2 instruction starts at about 3.536 μs. The instruction header overhead before MTE2 accounts for a large proportion of the entire operator pipeline. Therefore, scalar computation needs to be optimized.

The following table shows the profiling data before optimization. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 10.62 μs. According to the aic_scalar_time data in column G, the average scalar execution time is 6.32 μs.

Optimization Solution

As shown in the following figure, when static tiling is disabled by default, you can create a tiling object on the host and call the API to automatically obtain tiling parameters. Then, pass the tiling parameters from the host to the kernel and pass them during initialization on the kernel side. During operator execution, the tiling variable parameters are used to complete matrix multiplication.

Figure 1 Matmul computation process with static tiling disabled by default

As shown in the following figure, when static tiling is enabled, you only need to call the GetMatmulApiTiling API to obtain the static tiling information during compilation when creating a Matmul object on the kernel side. During operator execution, the static tiling parameters are used to complete matrix multiplication, reducing the scalar computation overhead.

Figure 2 Matmul computation process with static tiling enabled

For details about the complete example of enabling full static tiling by using Matmul APIs, see operator sample for Matmul static tiling. The procedure for enabling full static tiling is as follows:

When the GetMMConfig API is called to obtain the MatmulConfig template, set MatmulShapeParams to a constant value to obtain a customized MatmulConfig template CUSTOM_CFG with constant parameters.

        
             constexpr int32_t MAX_M = 10000; // custom matmul kernel support max value of M Dim shape
constexpr int32_t MAX_N = 10000; // custom matmul kernel support max value of N Dim shape
constexpr int32_t MAX_K = 10000; // custom matmul kernel support max value of K Dim shape
constexpr int32_t BASE_M = 128;  // BASE_M * BASE_K * sizeof(typeA) <=L0A size
constexpr int32_t BASE_N = 256;  // BASE_N * BASE_K * sizeof(typeB) <=L0B size
constexpr int32_t BASE_K = 64;   // BASE_M * BASE_N * sizeof(typeC) <=L0C size
constexpr MatmulShapeParams shapeParams = { MAX_M,
                                            MAX_N,
                                            MAX_K,
                                            BASE_M,
                                            BASE_N,
                                            BASE_K };
constexpr MatmulConfig CUSTOM_CFG = GetMMConfig<MatmulConfigMode::CONFIG_MDL>(shapeParams);

Create a Matmul object. Call the GetMatmulApiTiling API to replace the tiling information with constants to obtain the constant template parameter CONSTANT_CFG, including the Matmul tiling information and MatmulConfig template. When creating a Matmul object, use the constant template parameter CONSTANT_CFG.

        
             using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType>;
using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType>;
using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, cType>;
using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, biasType>;
constexpr static auto CONSTANT_CFG = AscendC::GetMatmulApiTiling<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE>(CUSTOM_CFG);
AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CONSTANT_CFG> matmulObj;

Perform the initialization operation. During full constantization, you can replace the position where the tiling parameter is passed in the input parameter of the REGIST_MATMUL_OBJ API with a null pointer. During partial constantization, tiling is still required when the REGIST_MATMUL_OBJ API is used to initialize the Matmul object on the kernel side.

        
             // Initialization example in the full constantization scenario
REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj, (TCubeTiling*)nullptr);

// Initialization example in the partial constantization scenario
REGIST_MATMUL_OBJ(&pipe, GetSysWorkSpacePtr(), matmulObj, &tiling);

Verifying Optimization Benefits

The following figure shows the optimized pipeline. By enabling full static tiling, you do not need to copy tiling parameters from the host to the kernel. Static tiling is completed during compilation, reducing scalar computation during Matmul initialization. The time from 0 μs to the initiation of the first MTE2 instruction is the Matmul initialization time. The Matmul initialization time is reduced from 3.536 μs to 2.185 μs, indicating an improvement in performance.
The following figure shows the optimized profiling data. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 7.87 μs, which is 25.9% shorter than the 10.62 μs before the optimization. According to the aic_scalar_time data in column G, the average scalar execution time is 3.38 μs, which is 46.5% shorter than the 6.32 μs before the optimization.

Summary

When an operator calls the Matmul API to complete matrix multiplication, if there are a large number of scalar computations during Matmul initialization, the instruction header overhead is affected. Alternatively, if there are a large number of scalar computations between Matmul iterations, the MTE2 pipeline is blocked. In these two scenarios, if the conditions for enabling static tiling (full static tiling or partial static tiling) are met, you can enable static tiling to reduce the scalar computation overhead and improve the operator performance.

Parent topic: Matmul Performance Tuning Cases