Enabling the NBuffer33 Template Using the Matmul High-Level API

Case Study

This case demonstrates how to use the Matmul high-level API to perform matrix multiplication in the matrix multiplication operator scenario and enable the NBuffer33 template to improve the operator performance. The implementation of the NBuffer33 template is as follows: The A matrix of single-core computing is divided into three 3x3 basic blocks. The three 3x3 basic blocks of the A matrix are fully loaded and stored in the L1 buffer. Each time, the matrix multiplication is performed with the 3x1 basic blocks of the B matrix. In addition, the 3x1 basic blocks of the B matrix required for the next computation are concurrently transferred into the DoubleBuffer until the matrix multiplication in the singleCoreN direction is completed. In the MTE2 bound scenario, the NBuffer33 algorithm is used to split data, staggering the transfer pipeline, reducing the amount of data transferred at a time, and balancing the data traffic of MTE2 and FixPipe to evenly distribute the bandwidth of the two. For details about the NBuffer33 template, see MatmulPolicy.

  • Application scenarios of enabling the NBuffer33 template

    In the MTE2 bound scenario, the NBuffer33 template can be enabled when the tiling parameters meet the constraints.

  • Constraints for enabling the NBuffer33 template
    • Only the MatmulConfig template (MDL) is supported.
    • The logical memory locations of the A and B matrices support only TPosition::GM.
    • Only the pure Cube mode (only matrix computation) is supported. The MIX mode (including matrix computation and vector computation) is not supported.
    • The Matmul computation result matrix C can be obtained only through the IterateAll API.
    • The values of stepM, stepKa, and stepKb are less than or equal to 3, and the following condition is met: stepKa = stepKb = Ceil(singleCoreK/baseK).
    • The sum of the size of the fully loaded basic block of the A matrix and the size of the loaded basic block of the B matrix cannot exceed the size of the L1 buffer.

The operator specifications are as follows.

Table 1 Operator specifications

Input

Shape

Data type

Format

a

256, 192

float16

ND

b

192, 512

float16

ND

In this case, the AI processor has 24 cores. The pure Cube mode of the Matmul high-level API is enabled in the operator, and the MDL template is used. The tiling parameters are as follows:

  • Original shape: M = 256, N = 512, K = 192.
  • Single-core shape: singleCoreM = 256, singleCoreN = 256, singleCoreK = 192.
  • Basic block shape: baseM = 128, baseN = 256, baseK = 64.
  • Tiling parameters related to the L1 cache: stepM = 2, stepN = 1, stepKa = 3, stepKb = 3.

Obtaining Profile Data

Use the msProf tool to obtain the operator simulation pipeline diagram and on-board profiling data, and analyze the pipeline status of the Cube and Fixpipe.

Analyzing Main Bottlenecks

  • The following figure shows the pipeline before optimization. In the default template of MatmulPolicy, matrices A and B are fully loaded, and only one copy of matrices A and B is moved. In this case, the MTE2 execution time is long, and the pipeline is serial.

  • The following figure shows the profiling data before optimization. The average aic_time is 34.01 μs.

Optimization Solution

Enable the NBuffer33 template. Before calling the GetTiling API, call the SetMatmulConfigParams API to enable the NBuffer33 mode so that the obtained tiling meets the requirements. When creating a Matmul object on the kernel side, enable the NBuffer33 template. For details about the complete example of enabling the NBuffer33 template, see sample for enabling the NBuffer33 template policy. Perform the following steps:

  • Tiling Implementation
    Before calling the GetTiling API to obtain the TCubeTiling structure, enable the NBuffer33 mode.
    1
    2
    3
    4
    5
    6
    7
    8
    matmul_tiling::MatmulConfigParams matmulConfigParams(1, false,
        matmul_tiling::ScheduleType::N_BUFFER_33, /* NBuffer33 mode */
        matmul_tiling::MatrixTraverse::NOSET, false);
    cubeTiling.SetMatmulConfigParams(matmulConfigParams);
    if (cubeTiling.GetTiling(tilingData) == -1) {
        std::cout << "Generate tiling failed." << std::endl;
        return {};
    }
    
  • Kernel implementation
    Set the template parameter MatmulPolicy to the NBuffer33 template policy and create a Matmul object.
    1
    2
    3
    4
    5
    6
    7
    AscendC::MatmulImpl<
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, aType>,
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, bType>,
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, cType>,
        AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, biasType>, CFG_MDL,
        AscendC::MatmulCallBackFunc<nullptr, nullptr, nullptr>,
        AscendC::Impl::Detail::NBuffer33MatmulPolicy> matmulObj;
    

Verifying Optimization Benefits

  • The following figure shows the pipeline after optimization. The tiling parameters remain unchanged. However, because stepM is 2, the NBuffer33 mode splits the transfer of the left matrix data into two times. As shown in the figure, the computation process (including MTE1, MMAD, and FIXPIPE) after the first MTE2 ends can be parallel with the second MTE2. Data transfer in blocks can reduce the overhead of some headers caused by a single data transfer, and optimize the performance of loading data.

  • The following figure shows the profiling data after optimization. The average aic_time is 32.66 μs, which is shorter than the 34.01 μs before optimization.

Congratulations

In the MTE2 Bound scenario, if the tiling parameters meet the condition that stepM, stepKa, and stepKb are less than or equal to 3, you can enable the NBuffer33 template to stagger the transfer pipelines of the split matrices, reducing the amount of data transferred at a time and balancing the data traffic between MTE2 and FixPipe.