Matmul High-level API Enabling NBuffer33 Template

Case Study

This case shows the optimized performance when the NBuffer33 template is enabled in the matrix multiplication of the high-level Matmul API. The implementation of the NBuffer33 template is as follows: The A matrix of single-core computing is divided into three 3x3 basic blocks. The three 3x3 basic blocks of the A matrix are fully loaded and stored in the L1 buffer. Each time, the matrix multiplication is performed with the 3x1 basic blocks of the B matrix. In addition, the 3x1 basic blocks of the B matrix required for the next computation are concurrently moved into DoubleBuffer until the matrix multiplication in the singleCoreN direction is completed. In the MTE2 Bound scenario, the NBuffer33 algorithm is used for data tiling, staggering the movement pipeline, reducing the size of data moved at a time, and balancing the data traffic of MTE2 and FixPipe to achieve even bandwidth distribution between them. For details about the NBuffer33 template, see MatmulPolicy.

Application scenarios of enabling the NBuffer33 template
In the MTE2 Bound scenario, the NBuffer33 template can be enabled when the tiling parameters meet the restrictions.

Restrictions on enabling the NBuffer33 template
- MatmulConfig is set to the MDL template.
- The logical memory positions of matrices A and B support only TPosition::GM.
- Only CUBE_ONLY (cube computation) is supported. The MIX mode (including cube computation and vector computation) is not supported.
- Only the IterateAll API can be used to obtain the computation result matrix C of Matmul.
- The values of stepM, stepKa, and stepKb are less than or equal to 3, and the following condition is met: stepKa = stepKb = Ceil(singleCoreK/baseK).
- The sum of the base block size of matrix A (fully loaded) and the base block size of matrix B (loaded) does not exceed the size of the L1 buffer.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data Type	Format
a	256, 192	float16	ND
b	192, 512	float16	ND

On the AI processor used in the current case, there are 24 cores in total, and the high-level API Matmul in CUBE_ONLY is enabled for the operator. The MDL template is used. The tiling parameters are as follows:

Original shape: M = 256, N = 512, K = 192.
Single-core shape: singleCoreM = 256, singleCoreN = 256, singleCoreK = 192.
Base block shape: baseM = 128, baseN = 256, and baseK = 64.
Tiling parameters related to the L1 cache: stepM = 2, stepN = 1, stepKa = 3, stepKb = 3.

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data, and focus on analyzing Cube and Fixpipe pipelines.

Analyzing Main Bottlenecks

The following figure shows the pipeline before optimization. In the default template of MatmulPolicy, matrices A and B are fully loaded, with their data moved only once. In this case, the execution time of MTE2 is long, and pipeline tasks are executed in serial.
The profiling data before optimization is as follows, with the average aic_time being 34.01 μs.

Optimization Solution

Enable the NBuffer33 template: Before calling the GetTiling API, call the SetMatmulConfigParams API to enable the NBuffer33 mode so that the obtained tiling information meets the requirements. Enable the NBuffer33 template when creating a Matmul object on the kernel side. For details about the complete example of enabling the NBuffer33 template, see sample for enabling the NBuffer33 template policy. The specific procedure is as follows:

Tiling implementation

Before calling the GetTiling API to obtain the TCubeTiling structure, enable the NBuffer33 mode.

         
              matmul_tiling::MatmulConfigParams matmulConfigParams(1, false,
    matmul_tiling::ScheduleType::N_BUFFER_33, /* NBuffer33 mode */
    matmul_tiling::MatrixTraverse::NOSET, false);
cubeTiling.SetMatmulConfigParams(matmulConfigParams);
if (cubeTiling.GetTiling(tilingData) == -1) {
    std::cout << "Generate tiling failed." << std::endl;
    return {};
}

Kernel implementation

Set the template parameter MatmulPolicy to the NBuffer33 template policy and create a Matmul object.

         
              AscendC::MatmulImpl<
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType>,
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType>,
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, cType>,
    AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, biasType>, CFG_MDL,
    AscendC::MatmulCallBackFunc<nullptr, nullptr, nullptr>,
    AscendC::Impl::Detail::NBuffer33MatmulPolicy> matmulObj;

Verifying Optimization Benefits

The following figure shows the pipeline after optimization. The tiling parameters remain unchanged. However, since stepM is 2, after the NBuffer33 mode is enabled, the left matrix data will be moved in two batches. As shown in the figure, the computation process (including MTE1, MMAD, and FIXPIPE) after the first MTE2 ends can be parallel with the second MTE2. Data movement in blocks can reduce the header overhead caused by a single data movement, thereby optimizing the data loading performance.
The following figure shows the profiling data after optimization. The average aic_time is 32.66 μs, which is shorter than the 34.01 μs before optimization.

Summary

In the MTE2 Bound scenario, if the tiling parameters meet the condition that stepM, stepKa, and stepKb are less than or equal to 3, you can enable the NBuffer33 template to stagger the movement pipelines by splitting the matrix, reducing the size of data transferred at a time and balancing the data traffic between MTE2 and FixPipe.

Parent topic: Matmul Performance Tuning Cases