Matmul High-Level API Performance Optimization by Enabling IBShare
Case Study
This case shows the optimized performance when IBShare is enabled for both matrix A and matrix B in the matrix multiplication of the Matmul high-level API in the fusion operator scenario.
The key optimization measures in this case are as follows:
- Core division logic: Based on the Cube core, the Matmul computation result is output to the GM and provided to the Vector core for subsequent computation.
- IBShare enable: IBShares is enabled for both matrix A and matrix B.
The operator specifications are as follows.
Input |
Shape |
Data type |
Format |
|---|---|---|---|
x |
128,384 |
float16 |
ND |
y |
384,256 |
float16 |
ND |
For details about the samples of enabling and disabling IBShare, see matmulABshare sample and MatmulNoABshare sample.
Obtaining Profile Data
Use the msProf tool to obtain the profile data of the operator, focusing on the MTE2, Cube, and Scalar pipelines.
Analyzing Main Bottlenecks

According to the preceding profile data, the ratio of Scalar time is high, implying that the performance bottleneck lies in the Scalar pipeline. Before the optimization, the average execution time is 27.06 μs.
Developing Optimization Solutions
If IBShare is disabled for matrix A and matrix B, data needs to be tiled based on the K, M, or N axis.

When IBShare is enabled for both matrix A and matrix B, they can share L1 buffer, without need for tiling and repeated loading. In addition, the Cube unit is driven by AIV0 (single core) and initiates a computation. The computation result is shared by AIV0 and AIV1, reducing the number of Cube responses and Scalar computations.

The following figure illustrates the message interaction comparison when the IBShare is enabled and disabled.

Set IBShare of MatmulType of matrix B to true to enable IBShare. The code is as follows:
constexpr bool isABshare = true;
template <typename aType, typename bType, typename cType> class MatmutABshareKernel {
public:
__aicore__ inline MatmutABshareKernel(){};
__aicore__ inline void Init(GM_ADDR a, GM_ADDR b, GM_ADDR c, GM_ADDR workspace,
const TCubeTiling &tiling, AscendC::TPipe *pipe);
__aicore__ inline void Process(AscendC::TPipe *pipe);
__aicore__ inline void CalcOffset(int32_t blockIdx, const TCubeTiling &tiling, int32_t &offsetA, int32_t &offsetB,
int32_t &offsetC);
Matmul<MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType, false, LayoutMode::NONE, isABshare>,
MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType, false, LayoutMode::NONE, isABshare>,
MatmulType<AscendC::TPosition::VECIN, CubeFormat::ND, cType>>
matmulObj;
AscendC::GlobalTensor<aType> aGlobal;
AscendC::GlobalTensor<bType> bGlobal;
AscendC::GlobalTensor<cType> cGlobal;
TCubeTiling tiling;
};
template <typename aType, typename bType, typename cType>
__aicore__ inline void MatmutABshareKernel<aType, bType, cType>::Init(GM_ADDR a, GM_ADDR b, GM_ADDR c,
GM_ADDR workspace,const TCubeTiling &tiling, AscendC::TPipe *pipe)
{
this->tiling = tiling;
aGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ aType *>(a), tiling.M * tiling.Ka);
bGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ bType *>(b), tiling.Kb * tiling.N);
cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ cType *>(c), tiling.M * tiling.N);
int32_t offsetA, offsetB, offsetC;
CalcOffset(AscendC::GetBlockIdx(), tiling, offsetA, offsetB, offsetC); // calculate offset
aGlobal = aGlobal[offsetA];
bGlobal = bGlobal[offsetB];
cGlobal = cGlobal[offsetC];
}
template <typename aType, typename bType, typename cType>
__aicore__ inline void
MatmutABshareKernel<aType, bType, cType>::CalcOffset(int32_t blockIdx, const TCubeTiling &tiling,
int32_t &offsetA, int32_t &offsetB, int32_t &offsetC)
{
offsetA = 0;
offsetB = 0;
offsetC = 0;
}
Verifying Optimization Benefits
After the optimization, the average execution time is 20.61 μs, which is much faster than that before the optimization.

Summary
In the fusion operator scenario, IBShare is enabled for both matrix A and matrix B of Matmul computation to divide cores from the perspective of Cube cores, effectively reducing the Scalar overhead on the Cube side and improving performance.