Tiling Policy Optimization for the Matmul Operator

Case Study

This case analyzes and optimizes the performance of the Matmul operator. The Matmul operator implements matrix multiplication, including the data movement pipeline, Cube computation pipeline,

The following uses the matrix dimensions M = 4096, N = 5120, K = 4096, half type input, float type out, and output format ND as an example to describe how to optimize the Matmul operator on the Atlas A2 training products / Atlas A2 inference products , including optimizing the core division logic, optimizing basic blocks, and enabling large packet transfer.

  • Core division logic: Enable as many Cube cores as possible to support parallel computing.
  • Optimize the basic blocks and select the optimal baseM, baseN, and baseK parameters, which are parameters in Matmul tiling.
  • Enable large packet transfer: When data is transferred from GM to L1, for matrix A, depthA1 basic blocks are transferred at a time, and the basic block size is baseM x baseK. For matrix B, depthB1 basic blocks are transferred at a time, and the basic block size is baseN x baseK. After large-block data transfer is enabled, the amount of data transferred at a time increases, thereby improving the MTE2 transfer efficiency.

Obtaining Profile Data

Use the msProf tool to obtain the profile data of the operator, focusing on the statuses of the MTE2, Cube, and Scalar pipelines.

Analyzing Main Bottlenecks

Figure 1 Profile data before optimization

According to the preceding profile data, the ratio of MTE2 time is high, implying that the performance bottleneck lies in MTE2 pipeline.

  • The cores are not fully divided for Block Dim of the profile data, so the core division logic needs to be optimized. Assume that CurrentCore is the number of Cube cores divided before optimization, and MaxCore is the maximum number of Cube cores. When all cores are enabled to compute the current shape data concurrently, the estimated performance gain is a multiple of MaxCore/CurrentCore.
  • Optimizing tiling basic blocks affects the data transfer efficiency. The total size of data transferred by the operator is the total data size of the left and right matrices. In the scenario where the K direction cannot be fully loaded during Matmul computation, according to the matrix multiplication algorithm, the number of times for transferring the left matrix is N/baseN, and the number of times for transferring the right matrix is M/baseM. That is, the total amount of transferred data is as follows: totalCnt = (N/baseN) x M x K + (M/baseM) x K x N. The estimated performance gain is the ratio of the amount of data transferred before optimization (totalCnt0) to that after optimization (totalCnt1). The result is as follows: (1/baseM0 + 1/baseN0)/(1/baseM1 + 1/baseN1), where baseM0 and baseN0 are the basic block parameters before optimization, and baseM1 and baseN1 are basic block parameters after optimization.
  • After large-block data transfer is enabled, factors such as the number of instructions and address alignment affect the performance. According to experience, in scenarios where MTE2 is the performance bottleneck, the MTE2 performance gain is greater than 20%.

Developing Optimization Solutions

  • Optimization 1: Optimizing the core division logic

    According to the profile data, the number of divided cores is 4. Starting more cores for computing at the same time can improve the computing parallelism. On the AI processor used in the current case, there are 20 cores in total, and each core contains one Cube core and two Vector cores. In the NPU calling program, set blockDim to the actual number of used cores (20).

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    // Code snippet:
    uint32_t blockDim = 20; // blockDim is 4 before optimization.
    CHECK_ACL(aclInit(nullptr));
    int32_t deviceId = 0;
    CHECK_ACL(aclrtSetDevice(deviceId));
    aclrtStream stream = nullptr;
    CHECK_ACL(aclrtCreateStream(&stream));
    
    uint8_t *aHost;
    uint8_t *aDevice;
    CHECK_ACL(aclrtMallocHost((void **)(&aHost), aFileSize));
    CHECK_ACL(
      aclrtMalloc((void **)&aDevice, aFileSize, ACL_MEM_MALLOC_HUGE_FIRST));
    ReadFile("./input/x1_gm.bin", aFileSize, aHost, aFileSize);
    // PrintData(aHost, 16, printDataType::HALF);
    CHECK_ACL(aclrtMemcpy(aDevice, aFileSize, aHost, aFileSize,
    					ACL_MEMCPY_HOST_TO_DEVICE));
    
    uint8_t *bHost;
    uint8_t *bDevice;
    CHECK_ACL(aclrtMallocHost((void **)(&bHost), bFileSize));
    CHECK_ACL(
      aclrtMalloc((void **)&bDevice, bFileSize, ACL_MEM_MALLOC_HUGE_FIRST));
    ReadFile("./input/x2_gm.bin", bFileSize, bHost, bFileSize);
    // PrintData(bHost, 16, printDataType::HALF);
    CHECK_ACL(aclrtMemcpy(bDevice, bFileSize, bHost, bFileSize,
    					ACL_MEMCPY_HOST_TO_DEVICE));
    
    uint8_t *workspaceHost;
    uint8_t *workspaceDevice;
    CHECK_ACL(aclrtMallocHost((void **)(&workspaceHost), workspaceSize));
    CHECK_ACL(aclrtMalloc((void **)&workspaceDevice, workspaceSize,
    					ACL_MEM_MALLOC_HUGE_FIRST));
    
    uint8_t *tilingHost;
    uint8_t *tilingDevice;
    CHECK_ACL(aclrtMallocHost((void **)(&tilingHost), tilingFileSize));
    CHECK_ACL(aclrtMalloc((void **)&tilingDevice, tilingFileSize,
    					ACL_MEM_MALLOC_HUGE_FIRST));
    CHECK_ACL(aclrtMemcpy(tilingHost, tilingFileSize, GenerateTiling(),
    					tilingFileSize, ACL_MEMCPY_HOST_TO_HOST));
    // PrintData(tilingHost, 16, printDataType::UINT32_T);
    CHECK_ACL(aclrtMemcpy(tilingDevice, tilingFileSize, tilingHost,
    					tilingFileSize, ACL_MEMCPY_HOST_TO_DEVICE));
    
    uint8_t *cHost;
    uint8_t *cDevice;
    CHECK_ACL(aclrtMallocHost((void **)(&cHost), cFileSize));
    CHECK_ACL(
      aclrtMalloc((void **)&cDevice, cFileSize, ACL_MEM_MALLOC_HUGE_FIRST));
    
    // ACLRT_LAUNCH_KERNEL(matmul_custom)
    // (blockDim, stream, aDevice, bDevice, cDevice, workspaceDevice, tilingDevice);
    matmul_custom_do(blockDim, stream, aDevice, bDevice, cDevice, workspaceDevice, tilingDevice);
    

    Matmul APIs are initiated from the Vector side. On the AI processor used in this case, the ratio of Cube Cores to Vector Cores is 1:2. Therefore, Matmul tiling must be performed based on 2 x blockDim, that is, the number of Vector Cores. The actual number of running cores set in the NPU calling program is 20. Therefore, in the tiling code, set data tiling based on 40 cores as follows:

    1
    2
    3
    4
    5
    6
    7
    int usedCoreNum = 40; // usedCoreNum is 8 before optimization.
    int runMode = 1;
    int32_t baseM = 64; // 64
    int32_t baseN = 64; // 64
    optiling::TCubeTiling tilingData;
    MultiCoreMatmulTiling tilingApi;
    tilingApi.SetDim(usedCoreNum);
    
    Figure 2 Profile data after the core division logic is optimized

    After the code is modified, the operator execution time decreases from 12045 μs to 2532 μs, which is approximately equal to (20 cores/4 cores) = 5, meaning five-fold performance improvement.

  • Optimization 2: Optimizing basic blocks

    The basic blocks configured in the current tiling are [baseM, baseN, baseK] = [64, 64, 256]. They have fewer Cube computing cycles and a lower compute-access ratio (that is, the ratio of the computing amount to the required data amount). The basic blocks for which the Matmul result is transferred to the GM are 64 x 64. Because the output format is ND and the data type is float, the start address for transferring the next Matmul result needs to be offset by one baseN, that is, 64 x 4 = 256 bytes. As a result, the GM address is not 512-byte aligned when the data is transferred out of fixpipe, and better basic blocks need to be set.

    If the shape is large, the compute-access ratio should be the largest according to the principle for selecting basic blocks, that is, when the Cube computation amount is the largest, the memory access data is the smallest. When the input type is fp16, the Cube unit can compute 16 x 16 x 16 numbers within one cycle. According to experience, [baseM, baseN, baseK] = [128, 256, 64] or [128, 128, 128]. These two tiling solutions meet the requirement of GM address 512-byte alignment during transfer-out (the address offsets are 256 x 4 bytes and 128 x 4 bytes respectively when the Matmul result is transferred out each time). They have the same number of cycles calculated by the Cube: (128 x 64 x 256)/(16 x 16 x 16) = (128 x 128 x 128)/(16 x 16 x 16) = 512 cycles. For [baseM, baseN, baseK] = [128, 256, 64], the compute-access ratio is 512 cycles/(128 x 64 x 2 + 256 x 64 x 2) = 512 cycles/48 KB. For [baseM, baseN, baseK] = [128, 128, 128], the compute-access ratio is 512 cycles/(128 x 128 x 2 + 128 x 128 x 2) = 512 cycles/64 KB. It can be learned that the [128, 256, 64] solution has a higher compute-access ratio and higher computing density. A smallest data size is required for the same computation amount, which maximizes the computation efficiency of Cube unit.

    Modify the tiling code and use the SetFixSplit() API to set baseM and baseN. The tiling function automatically computes the optimal baseK. Here, 64 is obtained.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    int32_t baseM = 128; // baseM is 64 before optimization.
    int32_t baseN = 256; // baseN 64 before optimization.
    
    optiling::TCubeTiling tilingData;
    MultiCoreMatmulTiling tilingApi;
    tilingApi.SetDim(usedCoreNum);
    tilingApi.SetAType(leftPos, leftFormat, leftDtype, bool(transposeA));
    tilingApi.SetBType(rightPos, rightFormat, rightDtype, bool(transposeB));
    tilingApi.SetCType(resPos, resFormat, resDtype);
    tilingApi.SetBiasType(biasPos, biasFormat, biasDtype);
    
    tilingApi.SetOrgShape(M, N, K);
    tilingApi.SetShape(M, N, K);
    tilingApi.SetFixSplit(baseM, baseN, -1);
    

    After this group of basic blocks is enabled, the time consumption of MTE2 (corresponding to aic_mte2_time) is reduced from 2452 μs to 808 μs, and the MTE2 performance is improved by three times.

    Figure 3 Profile data after basic blocks are optimized
  • Optimization 3: Enabling large packet transfer

    The current bandwidth utilization is totalSize/mte2Time = totalCnt x dtype/mte2Time, which is 2491 GB/s. When large packet transfer is disabled, only one basic block is transferred from the GM to L1 at a time. This function can be enabled using template parameters to transfer multiple basic blocks at a time, improving the bandwidth utilization of MTE2.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
     // Definition of the original matmul object:
      Matmul<AscendC::MatmulType<TPosition::GM, CubeFormat::ND, A_T>,
             AscendC::MatmulType<TPosition::GM, CubeFormat::ND, B_T>,
             AscendC::MatmulType<TPosition::GM, CubeFormat::ND, C_T>,
             AscendC::MatmulType<TPosition::GM, CubeFormat::ND, BiasT>>>
          mm;
     // Add the CFG_MDL parameter to the template parameter of the matmul object to enable large packet transfer.
      Matmul<AscendC::MatmulType<TPosition::GM, CubeFormat::ND, A_T>,
             AscendC::MatmulType<TPosition::GM, CubeFormat::ND, B_T>,
             AscendC::MatmulType<TPosition::GM, CubeFormat::ND, C_T>,
             AscendC::MatmulType<TPosition::GM, CubeFormat::ND, BiasT>, CFG_MDL>>
          mm;
    

    According to the figure below, after large packet transfer is enabled, the MTE2 time is reduced from 808 μs to 591 μs, and the bandwidth is 3406 GB/s, with utilization improved by more than 36%. The Cube utilization reaches more than 80%.

    Figure 4 Profile data after large packet transfer is enabled

Verifying Optimization Benefits

  • The core division logic is optimized, with 4.75x actual benefit, which is approximately equal to (20 cores/4 cores) = 5 times. The benefits are deemed the same in view of the core startup overhead.
  • The basic blocks are optimized, with 3x actual benefit, which is approximately equal to the theoretical benefit of (1/64 + 1/64)/(1/128 + 1/256) ≈ 2.7 times. The benefits are thought the same considering the impact of cache.
  • The actual gain of large packet transfer is more than 25%, which is consistent with the empirical value.

Summary

Optimizations 1 and 2 require a large enough shape and data size to fully divide cores and enable optimal basic blocks. When the shape is large, you can refer to the optimization methods in this case for the MTE2 Bound operator.