Matmul High-level API Enabling L2 Cache Tiling

Case Study

This case shows how L2 cache data tiling improves operator performance when the total size of input and output data exceeds the L2 cache size during Matmul computation. For details about the complete example of enabling L2 cache data tiling, see operator sample for L2 cache splitting.

In this case, the L2 cache size of the AI Processor is 192 MB, and the pure read bandwidth of the L2 cache is about three to four times that of the GM, which is a large gap. When the same size of data is moved in or out, accessing data in the L2 cache is faster than accessing data in the GM. If the data cannot hit the L2 cache, that is, the data to be accessed is not in the L2 cache, GM needs to be accessed for read and write operations. As a result, the bandwidth utilization is low, and the data moving-in or moving-out of the operator becomes a performance bottleneck in the entire running process of the operator.

Application scenarios of enabling L2 cache data tilting
The size of input and output data exceeds the L2 cache size.

The operator specifications are as follows.

**Table 1** Operator specifications
Input	Shape	Data type	Format
a	30720, 1024	float16	ND
b	4096, 1024	float16	ND

Obtaining Profile Data

Use the msProf tool to obtain the Operator Simulation Pipeline and On-board Profiling data. The L2 cache tiling function mainly uses the L2 cache with higher bandwidth to reduce the MTE2 data movement overhead. Therefore, the focus is on MTE2 pipeline analysis.

Analyzing Main Bottlenecks

This case is further optimized based on the full static tiling. For details about full static tiling, see the case in Matmul High-level API Enabling Full Static Tiling. The profiling data before optimization is as follows: The value of aic_time in column C is 867 μs, and the value of aic_mte2_time in column K is 861.9 μs. The MTE2 time accounts for 99%, indicating that MTE2 data movement is the bottleneck of the operator performance.

Optimization Solution

Optimization 1: Adjusting the block size and number of computation times
- Before the optimization, the input data is not tiled, and all cores compute all data at a time. As shown in the following figure, the numbers indicate the core IDs. All data in matrices A and B is computed by 24 cores at a time.
- After the optimization, the input data is tiled multiple times, and all cores compute data multiple times. Each core performs computation based only on the data size after tiling. The L2 cache tiling solution ensures that the data for a single computation is stored in the L2 cache, improving the efficiency of input data movement.
Figure 1 Optimization 1
Optimization 2: Selecting an L2 cache tiling solution with low smearing
According to the principle in Inter-core Load Balancing, the number of physical cores of the AI processor is fixed. After L2 cache tiling is performed on data, some cores may experience computation smearing. That is, the total computation workload of all cores divided by the size of data processed by each core in a single computation cannot be exactly divided by the number of cores. As a result, some tail cores need to compute the remaining data at the end of each computation. However, during tail core computation, some cores are always in an idle state, leading to deteriorated operator performance. In the following figure, the data blocks highlighted in yellow are the tail data blocks. In the left solution, due to the smearing effect, cores 0, 1, 2, and 3 perform an additional operation to process the remaining data in each computation. To achieve global load optimization, the position of the tail core is adjusted, as shown in the solution on the right. After all computations are completed, cores 0 to 7 each perform one additional data block computation.

In actual scenarios, the smaller the computation tail is, the better, provided that the size of data after tiling is less than the size of the L2 cache. Based on this principle, the number of L2 cache blocks can be determined.

Figure 2 Optimization 2
Optimization 3: Misaligned core allocation to reduce same-address conflicts between the left and right matrices
Same-address conflict: When multiple cores concurrently execute Matmul computation, if multiple cores access the same address of the input matrix at the same time, an address conflict occurs, affecting the performance.

In the M and N directions, the matrix data is tiled into large data blocks based on the size of the L2 cache, and then cores are allocated in a staggered manner between the data blocks. That is, each data block is allocated to different cores for processing in sequence along the diagonal, effectively reducing same-address conflicts. For example, when processing the tail blocks 0, 1, 2, and 3 in the same row, if the cores are allocated in sequence, multiple cores will read the left matrix data in the same row at the same time, resulting in a read-read conflict. If the cores are allocated in diagonal order, the tail block data on the diagonal is allocated to cores 0, 1, 2, and 3 for computation. In this way, multiple cores access the left matrix data in different rows, reducing the number of same-address conflicts.

Figure 3 Optimization 3

For details about the complete example of enabling L2 cache tiling using the Matmul API, see operator sample for L2 cache splitting. The key steps to implement L2 cache tiling are as follows:

Determine whether L2 cache tiling is required. If the total data size exceeds the preset L2 cache size, calculate the number of L2 cache tiles.

bool smallDim = mTileNum_ < L1_MIN_UST_DIM && nTileNum_ < L1_MIN_UST_DIM; 
if (smallDim || (!EnableL2Tile())) { // Check whether the total data size is less than the L2 cache threshold.
    mL2TileNum_ = mTileNum_;
    nL2TileNum_ = nTileNum_;
    mL2BlockNum_ = 1;
    nL2BlockNum_ = 1;
    return; // No tilting is required, and the result is returned in advance.
} 
InitL2TileTail(); // Compute the number of L2 cache tiles.

Based on the load balancing principle, compute the number of L2 cache tiles. The number of L2 cache tiles in the m direction is mL2TileNum_, and that of L2 cache tiles in the n direction is nL2TileNum_.

int64_t mConflict = INT64_MAX; 
int64_t nConflict = INT64_MAX; 
constexpr bool isNMajor = l1N > l1M; // Determine the main dimension based on the shape size.
for (int64_t i = maxMajor; i >= L1_MIN_UST_DIM; i--) {     
    for (int64_t j = maxMinor; j >= minMinor; j--) {         
        if (GetTotalSize(j * l1M, i * l1N, k_) <= L2_TILE_THRESHOLD) { // Ensure that the block size is less than the L2 cache threshold.
            uint64_t mConflictTmp = AscendC::Ceil(blockNum_, mL2TileNumTailTmp); // Compute the load conflict value.
            uint64_t nConflictTmp = AscendC::Ceil(blockNum_, nL2TileNumTailTmp);            
            if (mConflict >= mConflictTmp && nConflict >= nConflictTmp) { // If the conflict value is smaller, update the number of blocks.
                mConflict = mConflictTmp;              
                nConflict = nConflictTmp;         
                mL2TileNum_ = curMajorDim;                 
                nL2TileNum_ = curMinorDim;     
            }       
         }   
     }
 }

Misaligned core allocation. Enter the subscript of the current data block to obtain the subscript of the core allocated along the diagonal.

__aicore__ inline BlockCoord GetBlockCoord(int64_t tileIdx)    {  
    GetCommonTileIndex(tileIdx); 
    int64_t mTileIdx = newBlockIdx_ % mL2TileNumTmp_;
    mTileIdx = mTileIdx + mL2Idx_ * mL2TileNum_;
    int64_t nTileIdx = 0;     
    if (mL2TileNumTmp_ != 0 && nL2TileNumTmp_ != 0) {  
        int64_t tmp = newBlockIdx_ /CalcLcm(mL2TileNumTmp_, nL2TileNumTmp_);
        nTileIdx = (newBlockIdx_ + tmp) % nL2TileNumTmp_;
    }      
    nTileIdx = nTileIdx + nL2Idx_ * nL2TileNum_;     
    return {mTileIdx * l1M, nTileIdx * l1N, 0};
}

Set the left and right matrices, and perform Matmul computations iteratively based on the L2 cache tile count and execution core subscript computed in the previous steps.

L2CacheOpt l2Opt(shapes, blockNum); matmulObj.SetOrgShape(shapes.m, shapes.n, shapes.k);
for (int64_t tileIdx = curBlockIdx; tileIdx < l2Opt.GetTileNum(); tileIdx += blockNum) { 
    auto blockShape = l2Opt.GetBlockShape(tileIdx); // Obtain the data block size to be loaded into the L2 cache for each computation.
    if (Get<0>(blockShape) <= 0 ||
        Get<1>(blockShape) <= 0){
        return;
    }
    auto blockCoord = l2Opt.GetBlockCoord(tileIdx); 
    // Obtain the subscript blockCoord of the core that is performing computation.
    matmulObj.SetTail(Get<0>(blockShape), Get<1>(blockShape), Get<2>(blockShape)); 
    const auto& offsetCoord = CalcOffset(shapes, blockCoord); // Calculate the matrix offset based on the subscript.
    int64_t offsetA = Get<0>(offsetCoord);
    int64_t offsetB = Get<1>(offsetCoord);   
    int64_t offsetC = Get<2>(offsetCoord);
    matmulObj.SetTensorA(aGlobal[offsetA], false);  
    matmulObj.SetTensorB(bGlobal[offsetB], false);  
    if (shapes.isBias) {          
        matmulObj.SetBias(biasGlobal);    
    }  
    matmulObj.IterateAll(cGlobal[offsetC]); // Compute the data block size of L2 cache tilting.
} 
matmulObj.End();

Verifying Optimization Benefits

The following figure shows the profiling data after optimization. The value of aic_time in column C is 805.6 μs. Compared with the data before optimization, the total execution time is reduced by about 7.1%, and the MTE2 movement time is reduced by about 10.7%.

Summary

If the data size of Matmul computation exceeds the size of the L2 cache, you can enable L2 cache tilting to improve the L2 cache hit ratio and use the high bandwidth feature of the L2 cache to improve the operator performance.

Parent topic: Matmul Performance Tuning Cases