Matmul High-level API Enabling MDL Template

Case Study

This case shows the optimized operator performance when the MDL template is enabled in the matrix multiplication of the high-level Matmul API in the matrix multiplication operator scenario. In the MDL template, the data movement from the global memory to A1/B1 in the MTE2 pipeline is a one-time large-packet transfer. That is, multiple basic blocks of Matmul computation can be transferred into the A1/B1 at a time, improving the bandwidth utilization. In this way, the subsequent MTE1 pipeline can reuse the cached data of the basic blocks in A1/B1 as much as possible, reducing the number of MTE2 transfers. For details about the MDL template, see MatmulConfig.

Application scenarios of the MDL template
The MDL template is generally applicable to large-shape scenarios with a large number of MTE2 cyclic transfers. The MDL template caches the data required for multiple computations in A1/B1 to avoid frequent MTE2 transfers.
Constraints on the MDL template
The TCubeTiling structure of the MDL template must meet the TCubeTiling restrictions and the supplementary restrictions on the MDL template. For details, see TCubeTiling Structure.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data Type	Format
a	128, 1024	float16	ND
b	1024, 30720	float16	ND

On the AI processor used in the current case, there are 24 cores in total, and each core contains one AIC and two AIVs.

The tiling parameters are as follows:

Original shape: M = 128, N = 30720, K = 1024.
Single-core shape: The tiling is performed based on 24 AICs. singleCoreM = 128, singleCoreN = 1280, and singleCoreK = 1024.
For matrix B, the tiling is performed along the N axis, resulting in 24 single-core tiles (singleCoreN). A single core processes K x singleCoreN data. For matrix A, the M axis is not tiled, that is, singleCoreM = M. A single core processes singleCoreN x K data. A total of 24 cores are involved in the computation.
Basic block shape: baseM = 128, baseN = 256, and baseK = 64.
L1-related tiling parameters: stepM = 1, stepN = 1, stepKa = 4, stepKb = 4, depthA1 = 8, and depthB1 = 8.

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data. The MDL template mainly optimizes the MTE2 transfer efficiency, and therefore the focus is on analyzing the MTE2 pipeline.

Analyzing Main Bottlenecks

The following table shows the profiling data before optimization. By default, the Matmul operator uses the Norm template. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 83.68 μs. According to the aic_time data in column C, aic_mte2_time data in column L, and aic_mte2_ratio data in column M, the average time consumed by MTE2 is 75.64 μs, accounting for more than 92% of the total time. Therefore, the time consumed by MTE2 needs to be optimized.
The following figure shows the pipeline before optimization. MTE2 transfers basic blocks from the global memory to A1 or B1 for multiple times. The input matrix has a large shape, and MTE2 transfers basic blocks in a loop for multiple times. However, only one basic block is transferred each time. As a result, the bandwidth utilization is low, and the overall MTE2 transfer time is long. This affects the subsequent MTE1 and MMAD pipelines, resulting in a long synchronization waiting time between pipelines. As shown in the red box, the computation of the first basic block (baseM x baseN) requires calling the MMAD instruction 16 times (singleCoreK/baseK = 16). The time from the first MMAD instruction call on the left to the 16th MMAD instruction call on the right is 10.899 μs, with most of the time spent on pipeline synchronization.

Optimization Solution

The following figure shows the Matmul computation pipeline of the default Norm template. MTE2 transfers basic blocks from the global memory to A1 or B1 for multiple times, with only one basic block transferred at a time. The advantage of the Norm template is that it has a small startup overhead and can start the MTE1 pipeline in advance. The disadvantage of the Norm template is that in large-shape scenarios, MTE2 transfers data for multiple times, resulting in low transfer bandwidth utilization and high overall performance overhead.

Figure 1 Pipeline of the default Norm template

The procedure for implementing the Norm template is as follows:

Create a Matmul object and use the default Norm template parameter CFG_NORM.

        
             #define ASCENDC_CUBE_ONLY
#include "lib/matmul_intf.h"

using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>;
using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>;
using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>;
using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>;
AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_NORM> matmulObj; // Use CFG_NORM to define the Matmul object.

The following figure shows the Matmul computation pipeline of the MDL template. MTE2 transfers multiple basic blocks from the global memory to A1 or B1 at a time. Each time, stepM x stepKa basic blocks are transferred to A1 or stepN x stepKb basic blocks are transferred to B1. The advantage of the MDL template is that MTE2 transfers multiple basic blocks at a time, achieving high bandwidth utilization. The subsequent MTE1 pipeline can reuse the cached data in A1 or B1 as much as possible, reducing the number of repeated transfers by MTE2. The disadvantage of the MDL template is that the MTE2 header overhead time is long. The MTE1 pipeline can be started only after the MTE2 pipeline is complete, resulting in a late start time of MTE1.

Figure 2 Pipeline of the MDL template

For details about the complete example of enabling the MDL template by using the Matmul API, see Matmul API performance optimization sample. The procedure for enabling the MDL template is as follows:

Create a Matmul object and use the default MDL template parameter CFG_MDL.

        
             #define ASCENDC_CUBE_ONLY
#include "lib/matmul_intf.h"

using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>;
using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>;
using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>;
using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>;
AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_MDL> matmulObj; // Use CFG_MDL to define the Matmul object.

Verifying Optimization Benefits

The following figure shows the profiling data after optimization. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 53.4 μs, which is significantly improved compared with 83.68 us before optimization. According to the aic_mte2_time data in column L, the average MTE2 execution time decreases significantly from 75.64 μs before optimization to 46.24 μs.
The following figure shows the optimized pipeline. Compared with the default Norm template, the MDL template enables MTE2 to transfer multiple basic blocks at a time, reducing the overall number of MTE2 transfers. In addition, because MTE2 transfers multiple basic blocks to A1/B1 at a time, the subsequent MTE1 pipeline can reuse the cached data in A1/B1 as much as possible, reducing the pipeline synchronization time and improving the overall operator performance. As shown in the red box, the computation of the first basic block (baseM x baseN) requires calling the MMAD instruction 16 times (singleCoreK/baseK = 16). The time from the first MMAD instruction call on the left to the end of the 16h MMAD instruction call on the right is about 5.198 μs, which is significantly shorter than the 10.899 μs before optimization. The pipeline synchronization time is greatly reduced.

Summary

The MDL template can be enabled in scenarios where the input shape is large, the number of MTE2 transfers is large, and the synchronization between MTE1 and MTE2 pipelines takes a long time. MTE2 can transfer multiple basic blocks from the global memory to A1 or B1 at a time. In this way, the subsequent MTE1 pipeline can reuse the cached data in A1 or B1 as much as possible, reducing the number of MTE2 transfers and improving operator performance.

Parent topic: Matmul Performance Tuning Cases