Matmul High-Level API Performance Optimization by Enabling IBShare

Case Study

This case shows the optimized performance when IBShare is enabled for both matrix A and matrix B in the matrix multiplication of the Matmul high-level API in the fusion operator scenario.

The key optimization measures in this case are as follows:

Core division logic: Based on the Cube core, the Matmul computation result is output to the GM and provided to the Vector core for subsequent computation.
IBShare enable: IBShares is enabled for both matrix A and matrix B.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data type	Format
x	128,384	float16	ND
y	384,256	float16	ND

For details about the samples of enabling and disabling IBShare, see matmulABshare sample and MatmulNoABshare sample.

Obtaining Profile Data

Use the msProf tool to obtain the profile data of the operator, focusing on the MTE2, Cube, and Scalar pipelines.

Analyzing Main Bottlenecks

Figure 1 Profile data before optimization

According to the preceding profile data, the ratio of Scalar time is high, implying that the performance bottleneck lies in the Scalar pipeline. Before the optimization, the average execution time is 27.06 μs.

Developing Optimization Solutions

If IBShare is disabled for matrix A and matrix B, data needs to be tiled based on the K, M, or N axis.

Figure 2 IBShare disabled

When IBShare is enabled for both matrix A and matrix B, they can share L1 buffer, without need for tiling and repeated loading. In addition, the Cube unit is driven by AIV0 (single core) and initiates a computation. The computation result is shared by AIV0 and AIV1, reducing the number of Cube responses and Scalar computations.

Figure 3 IBShare enabled

The following figure illustrates the message interaction comparison when the IBShare is enabled and disabled.

Set IBShare of MatmulType of matrix B to true to enable IBShare. The code is as follows:

constexpr bool isABshare = true;
template <typename aType, typename bType, typename cType> class MatmutABshareKernel {
public:
    __aicore__ inline MatmutABshareKernel(){};
    __aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace,
                                const TCubeTiling &tiling, AscendC::TPipe *pipe);
    __aicore__ inline void Process(AscendC::TPipe *pipe);
    __aicore__ inline void CalcOffset(int32_t blockIdx, const TCubeTiling &tiling, int32_t &offsetA, int32_t &offsetB,
                                      int32_t &offsetC);
    Matmul<MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType, false, LayoutMode::NONE, isABshare>, 
           MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType, false, LayoutMode::NONE, isABshare>,
           MatmulType<AscendC::TPosition::VECIN, CubeFormat::ND, cType>>
        matmulObj;
    AscendC::GlobalTensor<aType> aGlobal;
    AscendC::GlobalTensor<bType> bGlobal;
    AscendC::GlobalTensor<cType> cGlobal;
    TCubeTiling tiling;
};
template <typename aType, typename bType, typename cType>
__aicore__ inline void MatmutABshareKernel<aType, bType, cType>::Init(GM_ADDR a, GM_ADDR b, GM_ADDR c, 
                                                                GM_ADDR workspace,const TCubeTiling &tiling, AscendC::TPipe *pipe)
{
    this->tiling = tiling;
    aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ aType *>(a), tiling.M * tiling.Ka);
    bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ bType *>(b), tiling.Kb * tiling.N);
    cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ cType *>(c), tiling.M * tiling.N);
    int32_t offsetA, offsetB, offsetC;
    CalcOffset(AscendC::GetBlockIdx(), tiling, offsetA, offsetB, offsetC); // calculate offset
    aGlobal = aGlobal[offsetA];
    bGlobal = bGlobal[offsetB];
    cGlobal = cGlobal[offsetC];
}
template <typename aType, typename bType, typename cType>
__aicore__ inline void
MatmutABshareKernel<aType, bType, cType>::CalcOffset(int32_t blockIdx, const TCubeTiling &tiling,
                                                             int32_t &offsetA, int32_t &offsetB, int32_t &offsetC)
{
    offsetA = 0;
    offsetB = 0;
    offsetC = 0;
}

Verifying Optimization Benefits

After the optimization, the average execution time is 20.61 μs, which is much faster than that before the optimization.

Figure 4 Profile data after optimization

Summary

In the fusion operator scenario, IBShare is enabled for both matrix A and matrix B of Matmul computation to divide cores from the perspective of Cube cores, effectively reducing the Scalar overhead on the Cube side and improving performance.

Parent topic: Best Practices