Enabling L2 Cache Splitting Using the Matmul High-Level API
Case Study
This case demonstrates the effect of L2 cache data splitting on operator performance when the total amount of input and output data during Matmul computation exceeds the L2 cache size. For details about the complete example of enabling L2 cache splitting, see operator sample for L2 cache splitting.
In this case, the L2 cache size of the AI Processor is 192 MB, and the pure read bandwidth of the L2 cache is about three to four times that of the GM, which is a large gap. When the same amount of data is moved in or out, accessing data in the L2 cache is faster than accessing data in the GM. If the data to be accessed is not in the L2 cache, the GM needs to be accessed for read/write, resulting in low bandwidth utilization. As a result, the data movement becomes the performance bottleneck of the entire operator running process.
- Application scenarios of enabling L2 cache splitting
The total amount of input and output data exceeds the L2 cache size.
The operator specifications are as follows.
Input |
Shape |
Data type |
Format |
|---|---|---|---|
a |
30720, 1024 |
float16 |
ND |
b |
4096, 1024 |
float16 |
ND |
Obtaining Profile Data
Use the msProf tool to obtain the operator simulation pipeline diagram and on-board profiling data. The L2 cache splitting function mainly uses the L2 cache with higher bandwidth to reduce the data transfer overhead of MTE2. Therefore, the focus is on analyzing the MTE2 pipeline.
Analyzing Main Bottlenecks
This case is further optimized based on the full tiling constant quantization. For details about the full tiling constant quantization, see the Enabling Full Constant Quantization for Matmul Tiling Using High-Level APIs case. The following figure shows the profiling data before the optimization. The value of aic_time in column C is 867 μs, and the value of aic_mte2_time in column K is 861.9 μs. The MTE2 time accounts for 99% of the total time, and the MTE2 data transfer is the performance bottleneck of the operator.

Optimization Solution
- Optimization 1: Adjusting the block size and number of computation times
- Before the optimization, the input data is not split, and all cores compute all data at a time. As shown in the following figure, the numbers indicate the core IDs. All data in matrices A and B is computed by 24 cores at a time.
- After the optimization, the input data is split into multiple blocks, and all cores compute the data in multiple times. Each core performs computation only based on the split data. The L2 cache splitting solution ensures that the data computed at a time is stored in the L2 cache, improving the efficiency of transferring input data.
Figure 1 Optimization 1
- Optimization 2: Selecting an L2 cache splitting solution with a smaller tailing effect.
As described in Inter-core Load Balancing, the number of physical cores of the AI Processor is fixed. After data is tiled into the L2 cache, some cores may have a computation tail. That is, the total computation volume of all cores divided by the volume of data processed by each core at a time cannot be exactly divided by the number of cores. As a result, the remaining data needs to be computed by some tail cores at the end of each computation. However, during the computation by tail cores, some cores are always idle, which deteriorates the overall performance of the operator. The yellow blocks in the following figure are the tail blocks. In the left solution, due to the computation tail, cores 0, 1, 2, and 3 perform one more computation for the remaining data in each computation. To achieve the optimal global load, the positions of the tail cores are adjusted, as shown in the right solution. In this way, when all computation is completed, cores 0 to 7 perform one more computation for the data blocks.
In actual scenarios, the smaller the computation tail is, the better, provided that the volume of data after tiling is less than the size of the L2 cache. The number of L2 cache tiles can be determined based on this principle.
Figure 2 Optimization 2
- Optimization 3: Staggered core allocation to reduce the conflict of accessing the same address by the left and right matrices
Address conflict: When multiple cores concurrently execute Matmul computation, if multiple cores access the same address of the input matrix at the same time, an address conflict occurs, affecting the performance.
In the M and N directions, the matrix data in the L2 cache is tiled into large data blocks, and then cores are staggered between data blocks. That is, each data block is allocated to different cores for processing in sequence along the diagonal, effectively reducing the conflict of accessing the same address. For example, when processing the tail blocks 0, 1, 2, and 3 in the same row, if the cores are allocated in sequence, multiple cores will read the left matrix data in the same row at the same time, resulting in a read conflict. If the cores are allocated in diagonal order, the tail blocks on the diagonal are allocated to cores 0, 1, 2, and 3 for computation. In this way, the cores access the left matrix data in different rows, reducing the number of conflicts of accessing the same address.
Figure 3 Optimization 3
For details about the complete example of enabling L2 cache tiling using the Matmul API, see operator sample for L2 cache splitting. The key steps to implement L2 cache tiling are as follows:
- Determine whether L2 cache tiling is required. If the total data volume exceeds the preset L2 cache size, calculate the number of L2 cache tiles.
1 2 3 4 5 6 7 8 9
bool smallDim = mTileNum_ < L1_MIN_UST_DIM && nTileNum_ < L1_MIN_UST_DIM; if (smallDim || (!EnableL2Tile())) { // Check whether the total data volume is less than the L2 cache threshold. mL2TileNum_ = mTileNum_; nL2TileNum_ = nTileNum_; mL2BlockNum_ = 1; nL2BlockNum_ = 1; return; // Splitting is not required. Return in advance. } InitL2TileTail(); // Calculate the L2 tile.
- Based on the load balancing principle, calculate the number of L2 cache tiles in the m direction (mL2TileNum_) and the number of L2 cache tiles in the n direction (nL2TileNum_).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
int64_t mConflict = INT64_MAX; int64_t nConflict = INT64_MAX; constexpr bool isNMajor = l1N > l1M; // Determine the main dimension based on the shape size. for (int64_t i = maxMajor; i >= L1_MIN_UST_DIM; i--) { for (int64_t j = maxMinor; j >= minMinor; j--) { if (GetTotalSize(j * l1M, i * l1N, k_) <= L2_TILE_THRESHOLD) { // Ensure that the block size is less than the L2 cache threshold. uint64_t mConflictTmp = AscendC::Ceil(blockNum_, mL2TileNumTailTmp); // Calculate the load conflict value. uint64_t nConflictTmp = AscendC::Ceil(blockNum_, nL2TileNumTailTmp); if (mConflict >= mConflictTmp && nConflict >= nConflictTmp) { // If the conflict value is smaller, update the number of blocks. mConflict = mConflictTmp; nConflict = nConflictTmp; mL2TileNum_ = curMajorDim; nL2TileNum_ = curMinorDim; } } } }
- Misaligned core allocation. The subscript of the current data block is input to obtain the subscript of the core allocated along the diagonal.
1 2 3 4 5 6 7 8 9 10 11 12
__aicore__ inline BlockCoord GetBlockCoord(int64_t tileIdx) { GetCommonTileIndex(tileIdx); int64_t mTileIdx = newBlockIdx_ % mL2TileNumTmp_; mTileIdx = mTileIdx + mL2Idx_ * mL2TileNum_; int64_t nTileIdx = 0; if (mL2TileNumTmp_ != 0 && nL2TileNumTmp_ != 0) { int64_t tmp = newBlockIdx_ /CalcLcm(mL2TileNumTmp_, nL2TileNumTmp_); nTileIdx = (newBlockIdx_ + tmp) % nL2TileNumTmp_; } nTileIdx = nTileIdx + nL2Idx_ * nL2TileNum_; return {mTileIdx * l1M, nTileIdx * l1N, 0}; }
- Set the left and right matrices, and calculate Matmul for multiple times based on the L2 cache tile size and the subscript of the execution core calculated in the previous steps.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
L2CacheOpt l2Opt(shapes, blockNum); matmulObj.SetOrgShape(shapes.m, shapes.n, shapes.k); for (int64_t tileIdx = curBlockIdx; tileIdx < l2Opt.GetTileNum(); tileIdx += blockNum) { auto blockShape = l2Opt.GetBlockShape(tileIdx); // Obtain the size of the L2 cache tile for a single computation. if (Get<0>(blockShape) <= 0 || Get<1>(blockShape) <= 0){ return; } auto blockCoord = l2Opt.GetBlockCoord(tileIdx); // Obtain the index blockCoord of the core that performs the current computation. matmulObj.SetTail(Get<0>(blockShape), Get<1>(blockShape), Get<2>(blockShape)); const auto& offsetCoord = CalcOffset(shapes, blockCoord); // Calculate the matrix offset based on the index. int64_t offsetA = Get<0>(offsetCoord); int64_t offsetB = Get<1>(offsetCoord); int64_t offsetC = Get<2>(offsetCoord); matmulObj.SetTensorA(aGlobal[offsetA], false); matmulObj.SetTensorB(bGlobal[offsetB], false); if (shapes.isBias) { matmulObj.SetBias(biasGlobal); } matmulObj.IterateAll(cGlobal[offsetC]); // Calculate the L2 tiling block. } matmulObj.End();
Verifying Optimization Benefits
The following figure shows the optimized profiling data. The aic_time in column C is 805.6 μs. Compared with the original data, the total execution time is reduced by about 7.1%, and the MTE2 transfer time is reduced by about 10.7%.

Congratulations
If the data volume of Matmul computation exceeds the size of the L2 cache, you can enable L2 cache splitting to improve the L2 cache hit ratio and use the high bandwidth feature of the L2 cache to improve the operator performance.