Enabling the MDL Template Using the Matmul High-Level API
Case Study
This case demonstrates how to use the Matmul high-level API to perform matrix multiplication in the matrix multiplication operator scenario and how to enable the MDL template to improve the operator performance. In the MDL template, the MTE2 pipeline transfers data from the global memory to A1/B1 in one large packet. That is, multiple basic blocks of Matmul computation can be transferred into the A1/B1 at a time, improving the bandwidth utilization. In this way, the subsequent MTE1 pipeline can reuse the cached data of the basic blocks in A1/B1 as much as possible, reducing the number of MTE2 transfers. For details about the MDL template, see MatmulConfig.
- Application scenarios of the MDL template
The MDL template is generally applicable to large-shape scenarios where the MTE2 pipeline performs data transfer for many times. The MDL template caches the data required for multiple times of computation in A1/B1 to avoid frequent data transfer by the MTE2 pipeline.
- Constraints of the MDL template
The TCubeTiling structure of the MDL template must meet the TCubeTiling constraints and the supplementary constraints of the MDL template. For details, see TCubeTiling Structure.
The operator specifications are as follows.
|
Input |
Shape |
Data type |
Format |
|---|---|---|---|
|
a |
128, 1024 |
float16 |
ND |
|
b |
1024, 30720 |
float16 |
ND |
The AI processor used in this case has 24 cores, each of which contains one AIC core and two AIV cores.
The tiling parameters are as follows:
- Original shape: M = 128, N = 30720, K = 1024.
- Single-core shape: The data is tiled based on 24 AIC cores. singleCoreM = 128, singleCoreN = 1280, and singleCoreK = 1024.
Matrix B is tiled into multiple tiles of singleCoreN along the N axis. A single core processes K x singleCoreN data. For matrix A, the M axis is not tiled, that is, singleCoreM = M. Each core processes data of the size singleCoreM x K. A total of 24 cores are involved in computation.
- Basic block shape: baseM = 128, baseN = 256, and baseK = 64.
- L1-related tiling parameters: stepM = 1, stepN = 1, stepKa = 4, stepKb = 4, depthA1 = 8, and depthB1 = 8.
Obtaining Profile Data
Use the msProf tool to obtain the operator simulation pipeline diagram and on-board profiling data. The MDL template is used to optimize the MTE2 transfer efficiency, and therefore the MTE2 pipeline is analyzed.
Analyzing Main Bottlenecks
- The following figure shows the profiling data before optimization. The Matmul operator uses the default Norm template. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 83.68 μs. According to the aic_time data in column C, aic_mte2_time data in column L, and aic_mte2_ratio data in column M, the average MTE2 time is 75.64 μs, accounting for more than 92% of the total time. Therefore, the MTE2 pipeline time needs to be optimized.
- The following figure shows the pipeline before optimization. MTE2 transfers basic blocks from the global memory to A1 or B1 for multiple times. The input matrix has a large shape, and MTE2 transfers basic blocks in a loop for multiple times. However, only one basic block is transferred each time. As a result, the bandwidth utilization is low, and the overall MTE2 transfer time is long. This affects the subsequent MTE1 and MMAD pipelines, resulting in a long synchronization waiting time between pipelines. As shown in the red box, 16 MMAD instructions (singleCoreK/baseK = 16) are called to compute the first basic block (baseM x baseN). The time from the first MMAD instruction call on the left to the 16th MMAD instruction call on the right is 10.899 μs, most of which is the time spent on pipeline synchronization waiting.
Optimization Solution
The following figure shows the Matmul computation pipeline of the default Norm template. MTE2 transfers basic blocks from the global memory to A1 or B1 for multiple times, and only one basic block is transferred each time. The advantage of the Norm template is that it has a small startup overhead and can start the MTE1 pipeline in advance. The disadvantage of the Norm template is that in the large-shape scenario, MTE2 transfers basic blocks for multiple times, resulting in low bandwidth utilization and high overall performance overhead.
The procedure for implementing the Norm template is as follows:
- Create a Matmul object and use the default Norm template parameter CFG_NORM.
1 2 3 4 5 6 7 8
#define ASCENDC_CUBE_ONLY #include "lib/matmul_intf.h" using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>; using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>; using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>; using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>; Using CFG_NORM to Define the Matmul Object in AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_NORM> matmulObj; //
The following figure shows the Matmul computation pipeline of the MDL template. MTE2 transfers multiple basic blocks from the global memory to A1 or B1 at a time. Each time, stepM x stepKa basic blocks are transferred to A1 or stepN x stepKb basic blocks are transferred to B1. The advantage of the MDL template is that MTE2 transfers multiple basic blocks at a time, which improves the bandwidth utilization. The subsequent MTE1 pipeline can reuse the cached data in A1 or B1 as much as possible, and MTE2 needs to transfer data for fewer times. The disadvantage of the MDL template is that the overhead time of MTE2 is long. The MTE1 pipeline can be started only after the MTE2 pipeline is completed. Therefore, the MTE1 startup time is delayed.
For details about the complete example of enabling the MDL template by using the Matmul API, see Matmul API performance optimization sample. The procedure for enabling the MDL template is as follows:
- Create a Matmul object and use the default MDL template parameter CFG_MDL.
1 2 3 4 5 6 7 8
#define ASCENDC_CUBE_ONLY #include "lib/matmul_intf.h" using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>; using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>; using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>; using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>; AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_MDL> matmulObj; // Using CFG_MDL to Define a Matmul Object
Verifying Optimization Benefits
- The following figure shows the optimized profile data. According to the aic_time data in column C, the maximum operator execution time among multiple cores is 53.4 μs, which is greatly improved compared with the 83.68 μs before the tuning. According to the aic_mte2_time data in column L, the average MTE2 time is greatly reduced from 75.64 μs before the tuning to 46.24 μs.
- The following figure shows the optimized pipeline. Compared with the default Norm template, the MDL template enables MTE2 to transfer multiple basic blocks at a time, reducing the overall number of MTE2 transfers. In addition, because MTE2 transfers multiple basic blocks to A1/B1 at a time, the subsequent MTE1 pipeline can reuse the cache data of A1/B1 as much as possible, reducing the pipeline synchronization wait time and improving the overall operator performance. As shown in the red box, the first basic block (baseM x baseN) computation requires 16 MMAD instructions to be called (singleCoreK/baseK = 16). The time required from the first MMAD instruction call on the left to the 16th MMAD instruction call on the right is about 5.198 μs, which is greatly reduced compared with the 10.899 μs before the tuning. The pipeline synchronization wait time is greatly reduced.
Congratulations
The MDL template can be enabled in scenarios where the input shape is large, the number of MTE2 transfers is large, and the synchronization wait time of the MTE1 pipeline for waiting for the MTE2 pipeline is long. By transferring multiple basic blocks from the global memory to A1 or B1 at a time through MTE2, the subsequent MTE1 pipeline can reuse the cache data of A1/B1 as much as possible, reducing the number of MTE2 transfers and improving the operator performance.



