Matmul High-level API Enabling IBShare Template for Sharing Matrix B Data

Case Study

This case shows the optimized operator performance when IBShare is enabled for matrix B in the matrix multiplication of the high-level Matmul API in the matrix multiplication operator scenario. IBShare shares the same matrix A or B data in the L1 buffer to reduce repeated MTE2 data movement overheads and improve operator performance. This function allows IBShare to be enabled for either matrix A or matrix B, or for both matrices A and B at the same time.

Application scenarios of IBShare
In the MIX scenario (including cube computation and vector computation), the GM addresses of matrix A or matrix B of multiple AIVs are the same, and the matrix A or matrix B reused by multiple AIVs are fully loaded on the L1 buffer.

Restrictions on enabling IBShare
- If IBShare is enabled for both matrix A and matrix B, IBShare must also be enabled for matrix A and matrix B of other Matmul objects in the same operator.
- In the scenario where IBShare is enabled for both matrix A and matrix B, only the IterateAll API can be called to obtain the cube computation result, and the result can be output only to the global memory.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data Type	Format
a	64, 384	float16	ND
b	384, 256	float16	ND

On the AI processor used in the current case, there are 20 cores in total, and each core contains one AIC and two AIVs. Because the input shape is small, this case uses a single core as an example. You can refer to the usage of the SetDim API in MIX mode and set the number of cores involved in the computation to 2 in the tiling program. The tiling parameters are as follows:

Original shape: M = 64, N = 256, K = 384.
Single-core shape: singleCoreM = 32, singleCoreN = 256, singleCoreK = 384. Matrix A is split into two halves, with one half processed on AIV0 and the other half processed on AIV1. AIV0 and AIV1 use the same matrix B data.
Base block shape: baseM = 32, baseN = 256, and baseK = 64.
Tiling parameters related to the L1 cache: stepM = 1, stepN = 1, stepKa = 6, stepKb = 6.

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data. IBShare is used to share the same matrix A or B data in the L1 Buffer to reduce the overhead of repeated MTE2 data transfer. Therefore, the focus is on analyzing the MTE2 pipeline.

Analyzing Main Bottlenecks

The following figure shows the pipeline before optimization. The IBShare template is not enabled, and the Norm template is used by default. The black box indicates the MTE2 transfer pipeline initiated by AIV0. MTE2 transfers data for 12 times, including 6 times for matrix A (stepM x stepKa = 6) and 6 times for matrix B (stepN x stepKb = 6). The red box indicates the MTE2 transfer pipeline initiated by AIV1, which is basically the same as that of AIV0. In this case, the matrix B used by AIV1 is the same as that used by AIV0, and singleCoreN = baseN x stepN, singleCoreK = baseK x stepKb. That is, the matrix B can be fully loaded to the L1 buffer. After AIV0 transfers matrix B to the L1 buffer, the matrix B data is cached in the L1 buffer for AIV1 to reuse, thereby reducing the overhead of repeated MTE2 transfer of matrix B.
The following figure shows the profiling data before optimization. The value of aic_time in column C is 10.29 μs, and the value of aic_mte2_time in column K is 5.56 μs.

Optimization Solution

The following figure shows the Matmul computation pipeline when the IBShare template is disabled (the Norm template is used by default). MTE2 transfers base blocks from the global memory to A1 or B1 for multiple times. Even if the base block data of matrix B transferred in two consecutive times is the same, the data is still transferred repeatedly.

Figure 1 Matmul pipeline when the IBShare template is disabled

The following figure shows the Matmul computation pipeline when the IBShare template is enabled. MTE2 transfers base blocks from the global memory to A1 or B1 for multiple times. If the base block data of matrix B transferred in two consecutive times is the same, the data is not transferred repeatedly. Instead, the data transferred to B1 for the first time is reused.

Figure 2 Matmul pipeline when the IBShare template is enabled

For details about the complete example of enabling the IBShare template to share matrix B using the Matmul API, see sample of enabling IBShare for matrix B only. The procedure for enabling IBShare is as follows:

Create a Matmul object.

        
             #define ASCENDC_CUBE_ONLY
#include "lib/matmul_intf.h"

using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>;
using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType, false, LayoutMode::NONE, true>; // Set the IBSHARE parameter of matrix B to true.
using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>;
using BIAS_TYPE =  AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>;
AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_IBSHARE_NORM> matmulObj; // Use the default IBShare template parameter CFG_IBSHARE_NORM to define the Matmul object.

Verifying Optimization Benefits

The following figure shows the pipeline after optimization. The MTE2 transfer pipeline initiated by AIV0 in the black box is the same as that before optimization. The MTE2 transfer pipeline initiated by AIV1 in the red box is reduced from 12 MTE2 data transfers for matrices A and B before optimization to only 6 MTE2 data transfers for matrix A, eliminating the overhead of 6 MTE2 data transfers for matrix B.
The following figure shows the profiling data after optimization. The aic_time in column C is 9.93 μs, which is 3.55% shorter than the 10.29 μs before optimization. The aic_mte2_time in column K is 4.71 μs, which is 15.46% shorter than the 5.56 μs before optimization.

Summary

In the MIX scenario (including cube computation and vector computation), the GM addresses of matrix A or matrix B of multiple AIVs are the same, and the matrix A or matrix B reused by multiple AIVs are fully loaded on the L1 buffer. You can enable the IBShare template to share the same matrix A or B data in the L1 buffer, reducing repeated MTE2 data transfer overheads and improving operator performance.

Parent topic: Matmul Performance Tuning Cases