MC2 Operator Performance Optimization

Case Study

The performance benefits of MC2 operators mainly come from the parallel execution of communication and computing. That is, the matrix for Matmul computation is split, and the Matmul computation of the next data block and the communication task of the current data block are executed in parallel, so that the compute and communication time is hidden. As shown in the following figure, the matrix for Matmul computation is split into two parts along the M axis. In this case, the Matmul computation of the second data block and the communication of the first data block can be performed in parallel, thereby improving the operator performance by hiding the computation time. In all figures in this section, MM indicates Matmul computing, and hcom indicates communication tasks.

The following uses the MC2 operator Matmul+hcom_allReduce (with shape M = 4096, N = 8192, K = 4096, and data type half) as an example to describe how to optimize the performance of MC2 operators.

Obtaining Profile Data

Use the msProf tool to obtain the operator profile data.

Obtain the profile data (ArithmeticUtilization.csv for cycle ratios of instructions) executed in the actual environment, including the ratio of each pipeline.
Obtain the simulation profile data (instruction pipeline chart), including the utilization of each pipeline. You can observe the dependency between pipelines to optimize the parallelism efficiency.

Analyzing Main Bottlenecks

As shown in the preceding figure, the performance benefits of the MC2 operators come from the parallel execution of tasks, but cannot be achieved in the following scenarios.

Scenario 1: No data tiling
If data tiling is not performed during Matmul computing, the MC2 operator is executed in serial mode.
Scenario 2: Large difference in computing time
When the execution time difference between Matmul compute and communication tasks is large, the time that can be hidden in parallel computing is small, and the overall operator execution time is close to the operator execution time when the operator is not split. In this case, performance benefits cannot be obtained.
Scenario 3: Poor linearity
Linearity is a metric used to measure the relationship between output and input. It describes whether the output can respond to input changes in a linear manner within a certain range. In performance analysis, linearity indicates whether the execution time of each piece of data after input data is split is in a linear relationship with the execution time of the original data. Here, poor linearity is classified into two aspects: poor linearity of Matmul or hcom. The following briefly describes various cases of Matmul linearity as an example.
- Good linearity:
  Before data tiling, the Matmul execution time is 200 μs. The Matmul input is evenly split into two parts. Assume that the Matmul execution time of each piece of data is 100 μs after data tiling. Through parallel execution, the actual performance benefit is 100 μs, as shown in the following figure.
- Poor linearity:
  Before data tiling, the Matmul execution time is 200 μs. The Matmul input is evenly split into two parts. Assume that the Matmul execution time of each piece of data is 150 μs after data tiling. Through parallel execution, the actual performance benefit is 50 μs, as shown in the following figure.
- Deteriorated linearity:
  Before data tiling, the Matmul execution time is 200 μs. The Matmul input is evenly split into two parts. Assume that the Matmul execution time of each piece of data is 200 μs after data tiling. Through parallel execution, the actual performance benefit is 50 μs worse, as shown in the following figure.
The preceding analysis is based on the linearity of Matmul. In practice, the linearity of Matmul or communication may be involved.

The general optimization principle of MC2 operators is to split data into small blocks as much as possible without deteriorating the computing performance.

Figure 1 Operator instruction pipeline before optimization

The preceding figure shows the operator instruction pipeline before optimization. The Matmul execution time is 888 μs, the hcom_allReduce communication time is 1025 μs, and the total time is 1913 μs.

Developing Optimization Solutions

It can be learned that the performance optimization of MC2 operators can be achieved mainly by optimizing the data tiling strategy.

Verifying Optimization Benefits

Data tiling phase 1

The left matrix (4096, 4096) of Matmul is split into two parts along the M axis (tileNum+tailNum in the following code), and the matrix of each data block is (2048, 4096). In this case, the single-operator execution time is 1419 μs.

        
             MatmulAllReduceCustomTilingData *tiling = context->GetTilingData<MatmulAllReduceCustomTilingData>();
tiling->param.rankDim = 8;
tiling->param.tileM = 2048;
tiling->param.tileNum = 2;
tiling->param.tailM = 0;
tiling->param.tailNum = 0;
tiling->param.rankM = 4096;
tiling->param.rankN = 8192;
tiling->param.rankK = 4096;
tiling->param.isTransposeA = 0;
tiling->param.isTransposeB = 0;
tiling->param.cToFloatLen = 0;
tiling->param.nd2NzWorkLen = true;
tiling->param.dataType = static_cast<uint8_t>(HCCL_DATA_TYPE_MAP.at(aType));

Data tiling phase 2

To achieve better linearity, Matmul performs finer-grained tiling into five blocks (tileNum+tailNum in the following code) in the M axis. In this case, the single-operator execution time is 1262 μs.

        
             MatmulAllReduceCustomTilingData *tiling = context->GetTilingData<MatmulAllReduceCustomTilingData>();
tiling->param.rankDim = 8;
tiling->param.tileM = 512; // tileM is 2048 before optimization.
tiling->param.tileNum = 1; // tileNum is 2 before optimization.
tiling->param.tailM = 896; // tailM is 0 before optimization.
tiling->param.tailNum = 4; // tailNum is 0 before optimization.
tiling->param.rankM = 4096;
tiling->param.rankN = 8192;
tiling->param.rankK = 4096;
tiling->param.isTransposeA = 0;
tiling->param.isTransposeB = 0;
tiling->param.cToFloatLen = 0;
tiling->param.nd2NzWorkLen = true;
tiling->param.dataType = static_cast<uint8_t>(HCCL_DATA_TYPE_MAP.at(aType));

The performance benefits of the two data tiling modes are as follows: In data tiling phase 1, the single-operator performance benefit is 26%. In data tiling phase 2, the single-operator performance benefit is 34%.

Summary

The performance optimization of the MC2 operators focuses on the computing performance, communication performance, and proper tiling strategy.

Parent topic: Best Practices