Matmul High-level API Enabling Multi-core K-axis Staggered Access to Device Memory

Case Study

This case shows the optimized operator performance when multi-core K-axis staggered access to device memory is enabled in the matrix multiplication of the high-level Matmul API in the matrix multiplication operator scenario. When multiple cores perform Matmul computation in parallel, if the memory location of the input matrix A or B is located in the GM and the matrices involved in multi-core computation are the same, the multiple cores will access the same GM address at the same time, resulting in an address access conflict and affecting operator performance. If multi-core K-axis staggered access to device memory is enabled, different cores corresponding to the K-axis direction of the split matrix will access and move data from different GM start addresses, thereby alleviating address access conflicts and improving operator performance.

Figure 1 Access address conflict

Figure 2 Address conflict mitigation

Applicable scenarios of enabling multi-core K-axis staggered access to device memory:
Scenarios where Matmul is executed on multiple cores and the K-axis value of the input matrix is large.
Restrictions on enabling multi-core K-axis staggered access to device memory:
- The K-axis of the input matrix is not fully loaded. That is, the data in the K direction of the matrix cannot be moved to and retained in the L1 Buffer at the same time.
- Only the MDL template is supported.
- Perform Matmul computation on multiple cores.
- The memory location of matrix A or B is in the GM.

The operator specifications are as follows.

**Table 1** Operator case specifications
Input	Shape	Data Type	Format
a	768, 6144	float16	ND
b	6144, 2048	float16	ND

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data, and focus on analyzing the MTE2 pipeline.

Analyzing Main Bottlenecks

The profiling data (PipeUtilization.csv) before optimization is as follows. The average value of aic_mte2_ratio reaches 0.93 μs, indicating that MTE2 occupies a high proportion of the overall operator execution duration. The operator is currently MTE2 Bound. In this case, the matrix is tiled in the M and N directions. The single-core shape [singleCoreM, singleCoreN, singleCoreK] is [128, 512, 6144], and the basic block shape [baseM, baseN, baseK] is [128, 256, 64]. Each time the data of matrix A is loaded, there is a possibility that multiple cores access the same GM address at the same time, causing an address conflict. As a result, the MTE2 movement efficiency decreases and the MTE2 execution duration increases.

The movement efficiency of MTE2 can also be verified by checking its bandwidth utilization. As shown in the following figure, the analysis of Memory.csv shows that the average bandwidth utilization of MTE2 is only 34.4%.

According to the OpBasicInfo.csv file, the overall execution time of the operator is 98.72 μs before optimization.

Optimization Solution

Enable multi-core K-axis staggered access to device memory: When creating a Matmul object, set the enableKdimReorderLoad parameter in MatmulConfig to true. For details about the enableKdimReorderLoad parameter, see MatmulConfig.

For details about the complete example of enabling multi-core K-axis staggered access to device memory, see operator sample for staggered data loading along the K axis. The procedure for enabling the function is as follows:

Set the enableKdimReorderLoad parameter in the MDL template to true to enable multi-core K-axis staggered access to device memory.

        
             constexpr MatmulConfig GetMDLKDimReorderConfig()
{
    auto CFG = CFG_MDL;
    CFG.enableKdimReorderLoad = true;
    return CFG;
}
constexpr static MatmulConfig MM_CFG = GetMDLKDimReorderConfig();

Create a Matmul object based on the customized MatmulConfig template parameters.

        
             AscendC::Matmul<AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType>,
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType>,
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, cType>,
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, biasType>, MM_CFG> matmulObj;

Verifying Optimization Benefits

The following shows the profiling data (PipeUtilization.csv) after optimization. As shown in the figure, the time consumed by MTE2 is significantly reduced. The average time consumed by MTE2 is reduced from 90 μs to 69.87 μs, and the maximum time consumed by MTE2 is reduced from 91.94 μs to 75.82 μs.

The following shows the bandwidth utilization of MTE2 (Memory.csv). The average bandwidth utilization increases to 41.7%.

According to the OpBasicInfo.csv file, the overall time consumed by the operator is reduced from 98.72 μs to 85.68 μs, and the performance is improved by 13.2%.

Summary

In the scenario where Matmul is executed on multiple cores, if the value of the K-axis of the input matrix is large (generally greater than 4096), you can use the MDL template and enable multi-core K-axis staggered access to device memory to relieve address access conflicts, improve the MTE2 movement efficiency, and optimize the operator performance.

Parent topic: Matmul Performance Tuning Cases