Matmul high-level APIs enable multi-core K-axis staggered memory access.

Case Study

In the matrix multiplication operator scenario, the Matmul high-level API is used for matrix multiplication computation to improve the operator performance by enabling the multi-core K-axis to access the device memory in staggered mode. When multiple cores perform Matmul computation in parallel, if the memory location of the input matrix A or B is located in the GM and the matrices involved in multi-core computation are the same, the multiple cores may access the same GM address at the same time. As a result, an address access conflict occurs, affecting operator performance. If the multi-core K-axis staggered access to the device memory is enabled, different cores corresponding to the K-axis direction of the split matrix access and transfer data from different GM start addresses to alleviate address access conflicts and improve operator performance.

Figure 1 Access address conflict
Figure 2 Address conflict mitigation
  • The application scenarios of enabling multi-core K-axis staggered memory access are as follows:

    Matmul is executed by multiple cores, and the K axis of the input matrix is large.

  • Restrictions on enabling multi-core K-axis staggered memory access:
    • The K axis of the input matrix is not fully loaded. That is, the data in the K direction of the matrix cannot be moved into or retained in the L1 buffer at the same time.
    • Only the MDL profile is supported.
    • Performs Matmul computation on multiple cores.
    • A memory location of the matrix A or the matrix B is located in the GM.

The operator specifications are as follows.

Table 1 Operator Case Specifications

Input

Shape

Data type

Format

a

768, 6144

float16

ND

b

6144, 2048

float16

ND

Obtaining Profile Data

Use the msProf tool to obtain the operator simulation pipeline and board profiling data, and focus on the pipeline of MTE2.

Analyzing Main Bottlenecks

The following figure shows the profile data (PipeUtilization.csv) before tuning. The average value of aic_mte2_ratio is 0.93. MTE2 accounts for a large proportion of the total operator execution duration. The current operator is MTE2 Bound. In this example, the matrix is split in the M and N directions. The single-core shape [singleCoreM, singleCoreN, singleCoreK] is [128, 512, 6144], and the basic block shape [baseM, baseN, baseK] is [128, 256, 64], each time the data of matrix A is loaded, there is a possibility that multiple cores access the same GM address at the same time. As a result, an address conflict occurs, the MTE2 transfer efficiency decreases, and the MTE2 execution time increases.

The transfer efficiency of MTE2 can also be verified by checking its bandwidth usage. As shown in the following figure, the average bandwidth usage of MTE2 is only 34.4% after Memory.csv is analyzed.

According to the OpBasicInfo.csv file, the total operator execution time before tuning is 98.72 μs.

Optimization Solution

Enable staggered access to the memory along the K axis. When creating a Matmul object, set the enableKdimReorderLoad parameter in MatmulConfig to true. For details about the enableKdimReorderLoad parameter, see MatmulConfig.

For details about the complete example of enabling staggered access to the memory along the K axis, see operator sample for staggered data loading along the K axis. The main steps to enable this function are as follows:

  1. Set the enableKdimReorderLoad parameter in the MDL template to true to enable staggered access to the device memory along the K axis.
    1
    2
    3
    4
    5
    6
    7
    constexpr MatmulConfig GetMDLKDimReorderConfig()
    {
        auto CFG = CFG_MDL;
        CFG.enableKdimReorderLoad = true;
        return CFG;
    }
    constexpr static MatmulConfig MM_CFG = GetMDLKDimReorderConfig();
    
  2. Create a Matmul object based on the customized MatmulConfig template parameters.
    1
    2
    3
    4
    AscendC::Matmul<AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType>,
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType>,
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, cType>,
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, biasType>, MM_CFG> matmulObj;
    

Verifying Optimization Benefits

The operator tiling parameters remain unchanged. The following is the profiling data (PipeUtilization.csv) after tuning. As shown in the figure, the time consumed by MTE2 is significantly reduced. The average time consumed by MTE2 is reduced from 90 μs to 69.87 μs, and the maximum time consumed by MTE2 is reduced from 91.94 μs to 75.82 μs.

The following figure shows the bandwidth utilization of MTE2 (Memory.csv). The average bandwidth utilization is increased to 41.7%.

According to the OpBasicInfo.csv file, the overall time consumed by the operator is reduced from 98.72 μs to 85.68 μs, and the performance is improved by 13.2%.

Congratulations

In the scenario where Matmul is executed on multiple cores, if the K axis of the input matrix is large (generally greater than 4096), you can use the MDL template and enable the staggered memory access function of the K axis to relieve address access conflicts, improve the MTE2 transfer efficiency, and optimize the operator performance.