Enabling the IBShare Template to Share Matrix B Data Using the Matmul High-Level API

Case Study

This case demonstrates how to use the Matmul high-level API to perform matrix multiplication in the matrix multiplication operator scenario, and how to enable the IBShare template to share matrix B data to improve the operator performance. The IBShare function shares the same matrix A or B data in the L1 Buffer, reducing the overhead of repeated MTE2 data movement and improving the operator performance. This function supports enabling IBShare for either matrix A or matrix B, or for both matrix A and matrix B.

  • Application scenarios of IBShare

    In the MIX scenario (including matrix and vector computation), the GM addresses of matrix A or B of multiple AIVs are the same, and the matrix A or B shared by multiple AIVs is fully loaded in the L1 Buffer.

  • Restrictions on enabling IBShare
    • If IBShare is enabled for both matrix A and matrix B, IBShare must also be enabled for matrix A and matrix B of other Matmul objects in the same operator.
    • If IBShare is enabled for both matrix A and matrix B, only the IterateAll API can be called to obtain the matrix computation result, and the result can be output only to the global memory.

The operator specifications are as follows.

Table 1 Operator specifications

Input

Shape

Data type

Format

a

64, 384

float16

ND

b

384, 256

float16

ND

The AI processor used in this case has 20 cores, each of which contains one AIC core and two AIV cores. Because the input shape is small, this case uses a single core as an example. You can refer to the usage of the SetDim API in MIX mode and set the number of cores involved in the computation to 2 in the tiling program. The tiling parameters are as follows:

  • Original shape: M = 64, N = 256, K = 384.
  • Single-core shape: singleCoreM = 32, singleCoreN = 256, singleCoreK = 384. Matrix A is split into two halves, with one half processed on AIV0 and the other half processed on AIV1. AIV0 and AIV1 use the same matrix B data.
  • Basic block shape: baseM = 32, baseN = 256, baseK = 64.
  • Tiling parameters related to the L1 cache: stepM = 1, stepN = 1, stepKa = 6, stepKb = 6.

Obtaining Profile Data

Use the msProf tool to obtain the operator simulation pipeline diagram and on-board profiling data. The IBShare function is used to share the same matrix A or B data in the L1 Buffer to reduce the overhead of repeated MTE2 data transfer. Therefore, the focus is on analyzing the MTE2 pipeline.

Analyzing Main Bottlenecks

  • The following figure shows the pipeline before optimization. The IBShare template is disabled, and the default Norm template is used. The black box indicates the MTE2 transfer pipeline initiated by AIV0. MTE2 transfers data for 12 times, including 6 times for matrix A (stepM x stepKa = 6) and 6 times for matrix B (stepN x stepKb = 6). The red box indicates the MTE2 transfer pipeline initiated by AIV1, which is basically the same as that of AIV0. In this case, the matrix B used by AIV1 is the same as that used by AIV0, and singleCoreN = baseN x stepN, singleCoreK = baseK x stepKb. That is, the matrix B can be fully loaded to the L1 Buffer. After the matrix B is moved to the L1 Buffer by AIV0, the matrix B data can be cached in the L1 Buffer and wait for AIV1 to reuse it. This saves the overhead of repeated MTE2 transfer of the matrix B.

  • The following figure shows the profiling data before optimization. The value of aic_time in column C is 10.29 μs, and the value of aic_mte2_time in column K is 5.56 μs.

Optimization Solution

The following figure shows the Matmul computation pipeline when the IBShare template is disabled (the default Norm template is used). MTE2 transfers basic blocks from the global memory to A1 or B1 for multiple times. Even if the basic block data of matrix B transferred in two consecutive times is the same, the data is still transferred repeatedly.

Figure 1 Matmul pipeline when the IBShare template is disabled

The following figure shows the Matmul computation pipeline when the IBShare template is enabled. MTE2 transfers basic blocks from the global memory to A1 or B1 for multiple times. If the basic block data of matrix B transferred in two consecutive times is the same, the data is not transferred repeatedly. The data transferred to B1 for the first time is reused.

Figure 2 Matmul pipeline when the IBShare template is enabled

For details about the complete example of enabling the IBShare template for the Matmul API to share matrix B, see sample of enabling IBShare for matrix B only. The procedure for enabling the IBShare function is as follows:

  1. Create a Matmul object.
    1
    2
    3
    4
    5
    6
    7
    8
    #define ASCENDC_CUBE_ONLY
    #include "lib/matmul_intf.h"
    
    using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>;
    using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType, false, LayoutMode::NONE, true>; // Set the IBSHARE parameter of matrix B to true.
    using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>;
    using BIAS_TYPE =  AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>;
    AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_IBSHARE_NORM> matmulObj; // Use the default IBSHARE template parameter CFG_IBSHARE_NORM to define the Matmul object.
    

Verifying Optimization Benefits

  • The following figure shows the optimized pipeline. The MTE2 transfer pipeline initiated by AIV0 in the black box is the same as that before the tuning. The MTE2 transfer pipeline initiated by AIV1 in the red box is reduced from 12 times of MTE2 data transfer between matrix A and matrix B before the tuning to only 6 times of MTE2 data transfer between matrix A, saving the overhead of 6 times of MTE2 data transfer between matrix B.

  • The following figure shows the optimized profiling data. The aic_time in column C is 9.93 μs, which is 3.55% higher than the 10.29 μs before the tuning. The aic_mte2_time in column K is 4.71 μs, which is 15.46% higher than the 5.56 μs before the tuning.

Congratulations

In the MIX scenario (including matrix and vector computation), if the GM addresses of the A or B matrices of multiple AIVs are the same, and the A or B matrices multiplexed by multiple AIVs are fully loaded in the L1 Buffer, you can enable the IBSHARE template to share the same A or B matrix data in the L1 Buffer, reducing the overhead of repeated MTE2 data transfer and improving the operator performance.