MC² Operator Performance Optimization

Case Study

The performance benefits of an MC² operator stem from parallel execution of computation and communication. Specifically, the input data is tiled into multiple blocks. The computation and communication tasks of the blocks form two separate pipelines, which are executed in parallel to implement pipeline overlapping, thereby enhancing the operator's performance. As shown in the following figure, in the scenario where the MC² operator performs Matmul computation and then communication, the input matrix is divided into two blocks along the M axis. The Matmul computation of the second block can be executed in parallel with the communication of the first block, thus saving time. In all figures in this section, MM indicates Matmul computing tasks, and hcom indicates communication tasks.

This case describes how to analyze the performance benefits of the MC² operator and how to formulate a proper data tiling strategy. For more complete samples of MC² operators, see MatmulAllReduce sample, MatmulReduceScatter sample, and AllGatherMatmul sample.

Obtaining Profile Data

Use the msProf tool to obtain the operator profile data.

Obtain the profile data on the board, including the proportion of each pipeline.
Obtain the simulation profile data (instruction pipeline chart), including the utilization of each pipeline. You can observe the dependency between pipelines to optimize the parallelism efficiency.

Analyzing Main Bottlenecks

The performance benefit formula of an MC² operator is as follows:

Serial execution time of operators before fusion = Execution time of the computation operator before fusion + Execution time of the communication operator before fusion

MC² operator benefit = (Serial execution time of operators before fusion – Execution time of the MC² operator after fusion)/Serial execution time of operators before fusion

The execution time of the MC² operator is restricted by the following factors, which affect the operator performance benefits.

Factor 1: Execution time difference between computation and communication tasks
If the execution time difference between computation and communication tasks is small, the computation and communication of the MC² operator can be executed in parallel, achieving pipeline overlapping and high performance benefits.

If the execution time difference between computation and communication tasks is large, the computation and communication in the MC² operator can be executed in parallel after fusion, but not that time-saving. The overall execution time of the operators is close to that of the operators when the data is not tiled in serial execution mode. In this case, the performance benefits are not obvious.
Factor 2: Execution time expansion of computation or communication caused by data tiling
After the input data is tiled into smaller blocks, the Matmul computation or communication task runs on these blocks separately, with expanded or longer execution time compared with that before data tiling. The expansion can be caused by the following: lower computation or communication efficiency due to excessively small data blocks after tiling, additional scheduling overhead due to too many data blocks, and conflict of the access to the L2 cache or device memory after parallel execution of computation and communication. The following uses Matmul as an example to describe the possible execution time expansion situations after data tiling.
- No expansion:
  Before data tiling, the Matmul execution time is 200 μs. The Matmul input is evenly split into two parts. Assume that the Matmul execution time of each piece of data is 100 μs after data tiling. Through parallel execution, the actual performance benefit is 100 μs, as shown in the following figure.
- Moderate expansion:
  Before data tiling, the Matmul execution time is 200 μs. The Matmul input is evenly split into two parts. Assume that the Matmul execution time of each piece of data is 150 μs after data tiling. Through parallel execution, the actual performance benefit is 50 μs, as shown in the following figure.
- Severe expansion:
  Before data tiling, the Matmul execution time is 200 μs. The Matmul input is evenly split into two parts. Assume that the Matmul execution time of each piece of data is 200 μs after data tiling. Through parallel execution, the actual performance benefit is 50 μs worse, as shown in the following figure.

As analyzed above, the execution time of the computation and communication is relatively balanced, and pipeline overlapping and performance benefits are better. However, the performance benefits are also affected by the potential execution time expansion caused by data tiling. The following describes how to formulate a data tiling strategy to achieve the optimal pipeline overlapping effect.

Optimization Solution

Take the MatmulAllReduce operator of the Atlas A2 training products / Atlas A2 inference products as an example. The input data format is ND and the data type is half. In this operator, the computation task is performed before the communication task. Assume that the shape of the left matrix in Matmul is [M, K], the shape of the right matrix is [K, N], and the communication object in the operator is the output matrix of Matmul. In this case, the input shape of the communication task is [M, N]. Because the K axis exists only in Matmul computation, a longer K axis leads to increased computation workload, making computation time longer than communication time. In this case, computation is the operator bound (bottleneck). Conversely, when the K axis is short, the computation workload is small, and computation is faster than communication. In this case, communication becomes the operator bound. Before formulating a data tiling strategy, perform computation and communication tasks respectively on the original matrix, and identify the bound based on the execution time of the two tasks.

The data tiling strategy of the operator must meet the following requirements:

Tile the M axis only. The HCCL APIs called by the communication task require that the memory of the tiled data be contiguous. If the M axis is tiled, the memory of each row of data is contiguous. Conversely, if the N axis is tiled, each row of data is cut off and its memory is discontiguous, which does not meet the communication requirements.
If A indicates a long block and B indicates a short block, only the contiguous arrangement of A or B can be achieved, for example, AAAB and BAAA.

As described above, the objective of data tiling is to achieve as much pipeline overlapping as possible. Based on the execution time difference between computation and communication tasks, there are two specific scenarios, which have their own tiling objectives:

Computation bound:
For the same data block after tiling, the computation execution time is longer than the communication execution time. In this case, computation is contiguous, and the tail block of communication should be short, as shown in the following figure.

Figure 1 Computation bound
Communication bound:
For the same data block after tiling, the communication execution time is longer than the computation execution time. In this case, communication is contiguous, and the head block of computation should be short, as shown in the following figure.

Figure 2 Communication bound

Preparations:

Before final data tiling, you need to identify the bound and fit the data size and execution time for Matmul computation and AllReduce communication. The procedure is as follows:

Execute the Matmul computation and AllReduce communication tasks separately on the input data, use the msProf tool to collect the execution time, and identify the task with a longer time as the bound. For example, if communication takes longer than computation, the scenario is defined as communication bound.
Tile the input data along the M axis into several data blocks whose M axis lengths are 256, 512, 768, 1024, and 2048. You can adjust the block sizes as required.
Perform AllReduce communication on the data blocks obtained in step 2 and use the msProf tool to collect the execution time t₁, t₂, ..., t_n of each data block. Then, analyze the relationship between the data size and execution time, and fit the result to obtain the formula t = CostComm(m). Specifically, m indicates the length of the M axis of the data block, t indicates the communication execution time of the data block, and CostComm indicates the mapping between m and t. Generally, the mapping is linear. If it is not linear, segmented fitting can be used. The following is an example:

Data size: x = m * N * sizeof(dataType), in bytes. The fitting formula is as follows:
- If x is less than 8 MB, t = –0.9698202 * x * x + 27.0622573 * x + 14.769, in μs.
- If x is greater than or equal to 8 MB, t = 13.58491263 * x + 61.508333, in μs.
Perform Matmul computation on each data block obtained in step 2, collect the computation execution time t of each data block in the same way as in step 3, and fit the formula t = CostMM(m) for the mapping between the M axis length (m) and the computation execution time (t).

Tiling algorithm steps:

Set the proper short block length in the M-axis direction based on the shapes (M, K, and N values) of the input matrices. In the following expressions, a, b, and c indicate the alternative short block lengths based on empirical values. After K and N are substituted into the following three expressions, the minimum value among a, b, and c is used as the selected short block length (m0).
- a * K * N ≥ 4 * 1024 * 1024 * 1024, where you need to find the smallest a that satisfies the inequality
- b * K * N / 1024 + b * N ≥ 6 * 1024 * 1024, where you need to find the smallest b that satisfies the inequality
- c ≥ 3 * 128, where you need to find the smallest c that satisfies the inequality
- m0 = min(a, b, c)
Based on the short block length m0 and the fitting formula, the computation execution time t0 = CostMM(m0) * 1.15 and the communication execution time t1 = CostComm(m0) * 1.15 are obtained.
Note: When communication and computation are executed in parallel, memory bandwidth preemption may occur, prolonging the execution time. Generally, the factor 1.15 is multiplied by the fitting formula based on empirical values. You can adjust the factor as required.
Based on the length of the short block, balance the communication and computation according to Figure 1 or Figure 2 to obtain the length of the long block. Align the length of the long block with 128 elements to ensure computation affinity. This case is a communication bound scenario. The balancing here is to equalize the communication time of the short block with the computation time of the long block. Use t1 as the computation execution time of the long block and substitute it into the formula t1 = CostMM(m1) to calculate m1, which is the length of the long block.
Calculate the number of long blocks by using count = (M – m0)/m1, where m0 indicates the length of the short block, m1 indicates the length of the long block, and M indicates the original M-axis length. Generally, the formula cannot be exactly divided, so the following operations are required:
- Discard the decimal part of the result and retain the integer part as the number of blocks.
- Because the decimal part is discarded, the M-axis length has a remainder. Therefore, you need to adjust the length of the long block, that is, m1 = (M – m0)/count.
- To ensure computation affinity, align m1 down to the nearest integer multiple of 128, which yields the updated result of m1.
- After m1 is adjusted, the M-axis length still has a remainder. Therefore, adjust the length of the short block, that is, m0 = M – (m1 * count).
- The final results of m0, m1, and count are obtained.

Verifying Optimization Benefits

Formulating a tiling strategy and verifying performance benefits

In this MatmulAllReduce case, the shape of the input matrix is M = 4096, K = 3072, N = 8192, the data type is half, and the number of kernels is 8. The msProf tool is used to collect data before operator fusion. The execution time of Matmul computation is 803 μs, the execution time of AllReduce communication is 1071 μs, and the total time is 1874 μs, which is a communication bound scenario. According to the preceding tiling algorithm, the tiling in this case is as follows:

Based on empirical values, the short block length m0 (the header block, that is, the first data block after tiling, in the communication bound scenario) in the M direction is set to 384. The communication data size x is calculated as follows: 384 * 8192 * 2/1024/1024 = 6 MB. According to the communication fitting formula, the estimated communication execution time is 143 μs. Considering that a memory bandwidth conflict may occur, the factor 1.15 is multiplied to obtain the communication execution time 164 μs.
Calculate the length of the long block in the M direction. According to the communication time of the short block, the computation time of the long block is also 164 μs. According to the computation fitting formula, the length m1 is 768.
Calculate the number of long blocks based on M = 4096, m0 = 384, and m1 = 768, that is, (4096 – 384)/768 = 4.83. Round down the value to 4.
Adjust the long block length m1 based on the short block length (m0 = 384) and the number of long blocks (count = 4), that is, (4096 – 384)/4 = 928. Align this value down to the nearest multiple of 128, resulting in 896 (updated result of m1).
Adjust the short block length m0 based on the long block length (m1 = 896) and the number of long blocks (count = 4), that is, (4096 – 896 * 4 = 512).
Finally, the original input matrix is divided into five data blocks with lengths of {512, 896, 896, 896, 896}.

As shown in the following code, set the tiling strategy in the tiling code of the operator. According to the tiling strategy, the execution time of the fused operator is 1262 μs. The performance benefit of the fused operator is (1874 – 1262)/1874 = 32.7%.

         
              MatmulAllReduceCustomTilingData *tiling = context->GetTilingData<MatmulAllReduceCustomTilingData>();
tiling->param.rankDim = 8;
tiling->param.tileM = 512; // Short block length
tiling->param.tileNum = 1; // Number of short blocks
tiling->param.tailM = 896; // Long block length
tiling->param.tailNum = 4; // Number of long blocks
tiling->param.rankM = 4096;
tiling->param.rankN = 8192;
tiling->param.rankK = 4096;
tiling->param.isTransposeA = 0;
tiling->param.isTransposeB = 0;
tiling->param.cToFloatLen = 0;
tiling->param.nd2NzWorkLen = true;
tiling->param.dataType = static_cast<uint8_t>(HCCL_DATA_TYPE_MAP.at(aType));

Adjustment for execution time expansion after tiling
As mentioned above, data tiling causes the expansion of the computation or communication execution time, which causes the deviation between the measured results and the theoretical values. For example, when there are a large number of tiled data blocks, the expansion of the execution time has a great impact on the performance. As a result, the performance may gain fewer benefits or even deteriorate. Therefore, you need to adjust the tiling strategy based on the theoretical values and the measured results.

Congratulations

The MC² operator obtains performance benefits by parallel execution of computation and communication after data tiling. However, the performance benefits are affected by the execution time expansion after data tiling. The main method of performance optimization for the MC² operator is to formulate a data tiling strategy based on theoretical derivation, and then adjust the strategy based on the measured results to find the optimal tiling strategy.

Parent topic: Best Practices