Matmul High-level API Enabling K-axis Tiling of Matrix Data in Multi-core Parallel Computation

Case Study

This case demonstrates how to use the Matmul high-level API for matrix multiplication in the matrix multiplication operator scenario and how to enable K-axis tiling of matrix data in multi-core parallel computation to improve operator performance. To implement parallel execution of operators on multiple cores and improve computing efficiency, matrix data needs to be tiled, and the tiled data blocks are allocated to different cores for processing. Generally, only the M and N axes are tiled, and the K axis is not tiled. If the values of M and N are small, it is difficult to tile the M and N axes. In this case, tiling the K axis needs to be considered. After the K-axis tiling of matrix data in multi-core parallel computation is enabled, the K axis of the matrix can be tiled so that the operator can be executed in parallel on multiple cores. Because the value of the K axis is large, not tiling the K axis in this scenario usually results in an excessively large size of input data for a single core. After K-axis tiling is enabled, the tiling policy can more effectively balance the output bandwidth and input bandwidth.

Application scenarios of enabling K-axis tiling of matrix data in multi-core parallel computation
- If the value of the K axis is large and the values of the M axis and N axis are smaller than that of the K axis, the K axis can be tiled to enable more cores to execute the operator in parallel.
- If the values of the M, N, and K axes of the matrix are all large, the K axis can be tiled to better balance the input and output bandwidths.

Restrictions on enabling K-axis tiling of matrix data in multi-core parallel computation
- In the scenario where K-axis tiling of matrix data in multi-core parallel computation is enabled, the result of matrix C can be output only to the global memory.
- In the scenario where K-axis tiling of matrix data in multi-core parallel computation is enabled, clear the global memory before writing the matrix C slice result to the global memory in the kernel code for the first time. When obtaining the matrix C slice result, enable the AtomicAdd operation. If the global memory is not cleared in advance, precision problems may occur due to original invalid data in the global memory during accumulation.
- In the scenario where K-axis tiling of matrix data in multi-core parallel computation is enabled, bias cannot be used in matrix multiplication.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data Type	Format
a	16, 1024	float16	ND
b	1024, 16	float16	ND

On the AI processor used in the current case, there are 24 cores in total, and the high-level API Matmul in CUBE_ONLY is enabled for the operator. The tiling parameters are as follows:

Original shape: M = 16, N = 16, K = 1024.
Single-core shape: When K-axis tiling of matrix data in multi-core parallel computation is disabled, singleCoreM = 16, singleCoreN = 16, and singleCoreK = 1024. When K-axis tiling of matrix data in multi-core parallel computation is enabled, singleCoreM = 16, singleCoreN = 16, and singleCoreK = 512.

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data.

Analyzing Main Bottlenecks

The following figure shows the pipeline before optimization. Because K-axis tiling of matrix data in multi-core parallel computation is not enabled and the values of the M axis and N axis are very small, the original matrix data is not tiled, and all data is computed on a single core.
The following figure shows the profiling data before optimization. It can be seen that the operator is executed only on a single core. The aic_time is about 19.60 μs, the average aic_mte2_time is about 13.72 μs, and the aic_mte2_ratio is high.

Optimization Solution

After K-axis tiling of matrix data in multi-core parallel computation is enabled, the data in the K direction of the matrix can be tiled. As shown in the following figure, matrix R blocks in matrix C are obtained by accumulating A1 x B1 + A2 x B2 + A3 x B3, where A1 x B1, A2 x B2, and A3 x B3 can be computed in parallel on multiple cores.

Figure 1 K-axis tiling of matrix data in multi-core parallel computation enabled

Call the EnableMultiCoreSplitK API before the GetTiling API to enable K-axis tiling of matrix data in multi-core parallel computation. In the kernel implementation, clear the global memory address of matrix C and then enable AtomicAdd. For details about the complete example of enabling K-axis tiling of matrix data in multi-core parallel computation, see operator sample for multi-core K splitting. The specific procedure is as follows:

Tiling implementation

Before obtaining the TCubeTiling structure by calling the GetTiling API, call the EnableMultiCoreSplitK API with the input parameter set to true to enable K-axis tiling of matrix data in multi-core parallel computation.

         
              cubeTiling.SetOrgShape(M, N, K);
cubeTiling.SetShape(M, N, K);
cubeTiling.EnableBias(isBias);
cubeTiling.SetBufferSpace(-1, -1, -1);
// tiling enable split K
cubeTiling.EnableMultiCoreSplitK(true);
if (cubeTiling.GetTiling(tilingData) == -1) {
    std::cout << "Generate tiling failed." << std::endl;
    return {};
}

Kernel implementation

Call the Fill API to clear the global memory address of matrix C.

         
              cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ cType*>(c), tiling.M * tiling.N);
// clear gm
Fill(cGlobal, tiling.M * tiling.N, (cType)0);

Call the IterateAll API to enable the AtomicAdd operation and complete matrix multiplication.

         
              // set AtomicAdd
uint8_t enAtomic = 1;
matmulObj.IterateAll(cGlobal, enAtomic);

Verifying Optimization Benefits

The following figure shows the pipeline after optimization. After the K-axis tiling of matrix data in multi-core parallel computation is enabled, the K direction of the original matrix is split. The size of data processed by a single core in the K direction is reduced from 1024 to 512, and the MTE2 pipeline becomes shorter.
The following figure shows the profiling data after optimization. It can be seen that the operator is executed on two cores. The average aic_time is about 13.70 μs, which is greatly improved compared with 19.60 μs before optimization.

Summary

When an operator uses the Matmul API to perform Cube computation, the M and N directions of the original matrix cannot be effectively split and the result is output to the global memory. In this case, you can enable the K-axis tiling of matrix data in multi-core parallel computation to implement multi-core parallel computing and improve computing efficiency.

Parent topic: Matmul Performance Tuning Cases