The Matmul high-level API enables the IBShare template to share matrix A and matrix B data.

Case Study

This case shows the optimized performance in the fused operator scenario where IBShare is enabled for both matrix A and matrix B in the matrix multiplication of the Matmul high-level API.

The key optimization measures in this case are as follows:

  • Core division logic: Based on the Cube core, the Matmul computation result is output to the GM and provided to the Vector core for subsequent computation.
  • IBShare enable: IBShare is enabled for both matrix A and matrix B.

The operator specifications are as follows.

Table 1 Operator specifications

Input

Shape

Data type

Format

x

128,384

float16

ND

y

384,256

float16

ND

For details about the samples of enabling and disabling IBShare, see MatmulABshare sample and .

Obtaining Profile Data

Use the msProf tool to obtain the profile data of the operator, focusing on the MTE2, Cube, and Scalar pipelines.

Analyzing Main Bottlenecks

Figure 1 Profile data before optimization

According to the preceding profiling data, the average execution time of the operator is 27.11 μs, and the average execution time of aic_scalar_time is 26.27 μs. The current performance bottleneck is the Scalar pipeline of Cube.

Developing Optimization Solutions

If IBShare is disabled for matrix A and matrix B, data needs to be tiled based on the K, M, or N axis. The following uses the K axis as an example. Before IBShare is enabled, the operator performs tiling from the perspective of AIV block. AIV0 initiates the A0 x B0 computation, and AIV1 initiates the A1 x B1 computation.

Figure 2 IBShare disabled

When IBShare is enabled for both matrix A and matrix B, they can be loaded to the L1 buffer, without needs for tiling and separate transfer. In addition, the Cube unit is driven by AIV0 (single core) and initiates a computation. The computation result is shared by AIV0 and AIV1, reducing the number of Cube responses and Scalar computations.

Figure 3 IBShare enabled

The following figure illustrates the data interaction comparison when the IBShare is enabled and disabled.

Set IBShare of MatmulType of matrix A and matrix B to true to enable optimization. The code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
constexpr bool isABshare = true;
template <typename aType, typename bType, typename cType> class MatmulABshareKernel {
public:
    __aicore__ inline MatmulABshareKernel(){};
    __aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace,
                                const TCubeTiling &tiling, AscendC::TPipe *pipe);
    __aicore__ inline void Process(AscendC::TPipe *pipe);
    __aicore__ inline void CalcOffset(int32_t blockIdx, const TCubeTiling &tiling, int32_t &offsetA, int32_t &offsetB,
                                      int32_t &offsetC);
    AscendC::Matmul<AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType, false, LayoutMode::NONE, isABshare>, 
           AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType, false, LayoutMode::NONE, isABshare>,
           AscendC::MatmulType<AscendC::TPosition::VECIN, CubeFormat::ND, cType>>
        matmulObj;
    AscendC::GlobalTensor<aType> aGlobal;
    AscendC::GlobalTensor<bType> bGlobal;
    AscendC::GlobalTensor<cType> cGlobal;
    TCubeTiling tiling;
};
template <typename aType, typename bType, typename cType>
__aicore__ inline void MatmulABshareKernel<aType, bType, cType>::Init(GM_ADDR a, GM_ADDR b, GM_ADDR c, 
                                                                GM_ADDR workspace,const TCubeTiling &tiling, AscendC::TPipe *pipe)
{
    this->tiling = tiling;
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ aType *>(a), tiling.M * tiling.Ka);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ bType *>(b), tiling.Kb * tiling.N);
    cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ cType *>(c), tiling.M * tiling.N);
    int32_t offsetA, offsetB, offsetC;
    CalcOffset(AscendC::GetBlockIdx(), tiling, offsetA, offsetB, offsetC); // calculate offset
    aGlobal = aGlobal[offsetA];
    bGlobal = bGlobal[offsetB];
    cGlobal = cGlobal[offsetC];
}
template <typename aType, typename bType, typename cType>
__aicore__ inline void
MatmulABshareKernel<aType, bType, cType>::CalcOffset(int32_t blockIdx, const TCubeTiling &tiling,
                                                             int32_t &offsetA, int32_t &offsetB, int32_t &offsetC)
{
    offsetA = 0;
    offsetB = 0;
    offsetC = 0;
}

Verifying Optimization Benefits

After the optimization, the average execution duration is 22.44 μs, which is much faster than that before the optimization.

Figure 4 Profile data after optimization

Summary

In the fused operator scenario, IBShare is enabled for both matrix A and matrix B of Matmul computation to divide cores from the perspective of Cube cores, effectively reducing the Scalar overhead on the Cube side and improving performance.