Enabling Multi-Core and K-Tile with Matmul High-Level APIs

Case Study

This case demonstrates the performance improvement of the matrix multiplication operator by using the Matmul high-level API to enable multi-core and K-tile. To implement parallel execution of the operator on multiple cores and improve the computing efficiency, the matrix data needs to be tiled, and the tiled data blocks are allocated to different cores for processing. Generally, only the M and N axes are tiled, and the K axis is not tiled. If the values of M and N are small, it is difficult to tile the M and N axes. In this case, the K axis needs to be tiled. After the multi-core and K-tile function is enabled, the K axis of the matrix can be tiled, so that the operator can be executed in parallel on multiple cores. Because the K axis is large, not tiling the K axis in this scenario usually causes a large amount of input data on a single core. After the K axis is tiled, the tiling policy can more effectively balance the output bandwidth and input bandwidth.

  • Application scenarios of enabling multi-core and K-tile
    • If the K axis of the matrix is large, and the M and N axes are smaller than the K axis, the K axis can be tiled to increase the number of cores for parallel execution of the operator.
    • If the M, N, and K axes of the matrix are large, the K axis can be tiled to better balance the input and output bandwidths.
  • Restrictions on enabling multi-core and K-tile
    • In the scenario where multi-core and K-tile is enabled, the result of matrix C can be output only to the global memory.
    • In the scenario where multi-core and K-tile is enabled, the global memory must be cleared before the tiled result of matrix C is written to the global memory for the first time in the kernel code. When the tiled result of matrix C is obtained, AtomicAdd is enabled. If the global memory is not cleared in advance, the precision may be affected due to the accumulation of original invalid data in the global memory.
    • In the scenario where multi-core and K-tile is enabled, bias cannot be used in matrix multiplication.

The operator specifications are as follows.

Table 1 Operator specifications

Input

Shape

Data type

Format

a

16, 1024

float16

ND

b

1024, 16

float16

ND

The AI processor used in this case has 24 cores, and the high-level API Matmul is enabled in pure Cube mode. The tiling parameters are as follows:

  • Original shape: M = 16, N = 16, K = 1024.
  • Single-core shape: When multi-core and K-tile is disabled, singleCoreM = 16, singleCoreN = 16, and singleCoreK = 1024. When multi-core and K-tile is enabled, singleCoreM = 16, singleCoreN = 16, and singleCoreK = 512.

Obtaining Profile Data

Use the msProf tool to obtain the operator simulation pipeline diagram and on-board profiling data.

Analyzing Main Bottlenecks

  • The following figure shows the flow chart before optimization. The multi-core-to-K conversion is not enabled, and M and N are not always small. The original matrix data is not split, and all data is calculated on a single core.
  • The following figure shows the profile data before tuning. The operator is executed only on a single core. The aic_time is about 19.60 μs. The average time consumption of aic_mte2_time is about 13.72 μs, and the proportion of aic_mte2_ratio is high.

Optimization Solution

After multi-core-to-K splitting is enabled, the data in the K direction of the matrix can be split. As shown in the following figure, the R matrix block in the C matrix is obtained by accumulating A1*B1+A2*B2+A3*B3. A1*B1, A2*B2, and A3*B3 can be calculated in parallel on multiple cores.

Figure 1 Enabling Multi-Core-to-K Switching

To enable the multi-core-to-K switching function, call the EnableMultiCoreSplitK API before the GetTiling API to enable the multi-core-to-K switching function. In the kernel implementation, clear the global memory address of matrix C and then enable AtomicAdd. For details about how to enable the multi-core-to-K switchover, see operator sample for multi-core K splitting. Perform the following steps:

  • Tiling Implementation
    Before obtaining the TCubeTiling structure by calling the GetTiling API, call the EnableMultiCoreSplitK API with the input parameter set to true to enable multi-core-to-K switching.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    cubeTiling.SetOrgShape(M, N, K);
    cubeTiling.SetShape(M, N, K);
    cubeTiling.EnableBias(isBias);
    cubeTiling.SetBufferSpace(-1, -1, -1);
    // tiling enable split K
    cubeTiling.EnableMultiCoreSplitK(true);
    if (cubeTiling.GetTiling(tilingData) == -1) {
        std::cout << "Generate tiling failed." << std::endl;
        return {};
    }
    
  • Kernel Implementation
    Call the Fill interface to clear the global memory address of matrix C.
    1
    2
    3
    cGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ cType*>(c), tiling.M * tiling.N);
    // clear gm
    Fill(cGlobal, tiling.M * tiling.N, (cType)0);
    
    Call the IterateAll API to enable AtomicAdd accumulation and complete the matrix multiplication operation.
    1
    2
    3
    // set AtomicAdd
    uint8_t enAtomic = 1;
    matmulObj.IterateAll(cGlobal, enAtomic);
    

Verifying Optimization Benefits

  • The following figure shows the pipeline after optimization. After the multi-core K splitting is enabled, the K direction of the original matrix is split. The amount of data processed by a single core in the K direction is reduced from 1024 to 512, and the amount of data processed by a single core is halved, and the MTE2 pipeline is shortened.

  • The following figure shows the profiling data after the optimization. It can be seen that the operator is executed on two cores. The average aic_time is about 13.70 μs, which is greatly improved compared with 19.60 μs before the optimization.

Congratulations

When the Matmul API is used to complete matrix computation, the M and N directions of the original matrix cannot be effectively split. When the result is output to the global memory, you can enable the multi-core K splitting function to implement multi-core parallelism and improve the computing efficiency.