Matmul Constant Operator Performance Optimization
Case Study
This case shows how to improve the operator performance by enabling Matmul static tiling when the Matmul high-level API is used for matrix multiplication computation. For details about Matmul static tiling, see Matmul Static Tiling.
Matmul APIs involve a large number of Scalar computations during initialization and iteration. The Scalar computations during Matmul initialization affect the instruction head overhead, and the Scalar computations between Matmul iterations may block the MTE2 pipeline. The most variables involved in the Scalar computations during the two processes are the Tiling parameters singleCoreM/singleCoreN/singleCoreK and baseM/baseN/baseK. In the kernel implementation, if the variables obtained from the tiling parameters are replaced with constants and the related computations are moved forward to the compilation phase, the Scalar computation overhead can be reduced at run time.
The operator specifications are as follows.
|
Input |
Shape |
Data type |
Format |
|---|---|---|---|
|
a |
32, 2048 |
float16 |
ND |
|
b |
2048, 32 |
float16 |
ND |
For more examples of enabling Matmul static tiling, see matmul_api_constant.
Obtaining Profile Data
Use the msProf tool to obtain the profile data of the operator, focusing on the status of the Scalar pipeline.
Analyzing Main Bottlenecks
According to the preceding profile data, the ratio of Scalar time is high, implying that the performance bottleneck lies in the Scalar pipeline. Before performance optimization, the average time required for executing an operator for multiple times is 21.88 μs.
Developing Optimization Solutions
Enable Matmul static tiling. Specifically, when creating a Matmul object, use the constant template parameters. The procedure is as follows:
- When the GetMMConfig API is called to obtain the MatmulConfig template, set MatmulShapeParams to a constant value to obtain a customized MatmulConfig template with constant parameters.
- Call the GetMatmulApiTiling API to replace the tiling information with constants to obtain the constant template parameters, including the Matmul tiling information and MatmulConfig template.
- When creating a Matmul object, use the constant template parameters in step 2.
Under these operator specifications, the optimal tiling obtained by the GetTiling API is achieved. The values of singleCoreM, singleCoreN, and singleCoreK are 32, 32, and 2048 respectively, and the values of baseM, baseN, and baseK are 32, 32, and 256 respectively. You only need to modify the part for creating the Matmul object in the original code. The details are as follows:
1 2 3 4 5 6 |
constexpr static MatmulShapeParams shapeParams = {32, 32, 2048, 32, 32, 256}; constexpr static MatmulConfig MM_CFG = GetMMConfig<MatmulConfigMode::CONFIG_NORM>(shapeParams); constexpr MatmulApiStaticTiling static MM_CFG_CONSTANT = GetMatmulApiTiling<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE>(mmConfig); MatmulImpl<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, MM_CFG_CONSTANT> mm; |
Verifying Optimization Benefits
After the performance optimization, the average time required for executing an operator for multiple times is 17.728 μs, and the Scalar computation time is reduced by 44.4%.
Summary
If the shape information (singleCoreM/singleCoreN/singleCoreK and baseM/baseN/baseK) for a single Matmul computation in a single core is determined, you can enable Matmul static tiling to reduce the Scalar computation overhead and improve operator performance.