Analysis Example of Roofline Model Tuning

Background

The Da Vinci chip architecture is complex. Manually identifying its performance bottlenecks based on profile data is time-consuming and has high requirements. The Roofline model uses the computing strength as the horizontal axis and the number of floating-point operations per second as the vertical axis to describe the optimal performance and actual load features of the chip, providing the advantages and disadvantages, as well as the tuning direction.

MindStudio Advisor Operations

Prepare a MobileNetv3 model, and use the profile data file collected by Profiling, OM file converted by ATC, and CCE file as the input files. The file path is the root path of the ${data_path} data directory.
Open a built application project.
Choose Ascend > Advisor on the menu bar. The MindStudio Advisor page is displayed. See Figure 1.
Figure 1 MindStudio Advisor page
Click New Project in the upper left corner of the Figure 1. The Advisor system configuration page is displayed.
Figure 2 OM only
Set related parameters according to Figure 2 and click Start.
After the analysis is complete, the system displays the analysis result. The Roofline analysis result of the MobileNetV3 model is as follows:
Figure 3 Roofline display of the analysis result

Fault Analysis

The command output displays information about the top N operators. The fields are described as follows: The results are sorted in descending order based on the AI Core running time. Tune operators with high time ratios first.

Field	Description
Op Name	Operator name.
AICore Time(us)	AI Core running duration (μs).
Bottleneck pathway	Bottleneck channel, that is, the shortest channel from the working point to the Roofline.
Bottleneck Rate	Bottleneck rate, that is, the percentage of the working point to the Roofline upper limit.
Bottleneck Pipeline	Pipeline with the highest proportion.
Pipeline Rate	Highest pipeline rate.
Bound Type	Bottleneck type.

Based on the Roofline result, you can quickly obtain the performance bottlenecks and tuning methods. AICore Time and Bottleneck Rate can comprehensively measure the model tuning benefit. For example, assume that the bottleneck rate of MobilenetV3, Conv, or Conv2D is 25.91%, and the operator execution time is 293.7 μs. If the operator execution time can be shortened by 80% after tuning, the benefit is estimated to be 200 μs.
The following describes the analysis processes of several operators:
1. Operator MobilenetV3, Conv, or Conv2D: L2->L1 latency memory bound, and bound ratio=25.91%. Analyze the pipeline parallelism degree of the operator. It is found that the MTE2 pipeline takes the most time, accounting for 91.50% of the total time consumed by the operator. Therefore, the block point of the operator is the MTE2 pipeline transfer. You can check whether the data transfer granularity is too small or whether data transfer dependencies exist.
2. Operator MobilenetV3, expanded_conv, depthwise, or depthwise operator: L1 > L0A latency memory bound, and bound ratio=35.45%. Analyze the pipeline parallelism degree of the operator. It is found that the MTE2 pipeline consumes the most time, accounting for 46.66% of the total time consumed by the operator. The percentage is lower than 80%, and the pipeline parallelism is abnormal.
3. For the MobilenetV3, expanded_conv_3, squeeze_excite, or AvgPool operator, the data channel closest to the Roofline is L2->UB, and the bound ratio is 13.86%. The operator has a latency bound. Analyze the pipeline parallelism degree. It is found that the vector pipeline accounts for the largest proportion (85.11%), indicating that the pipeline parallelism degree is good. However, the time consumed by the vector pipeline accounts for the largest proportion. The latency bound in the Roofline indicates that the vector calculation process is incorrect. We need to analyze the vector's problem.

Troubleshooting

For each operator, you need to develop a specific solution to rectify its specific problem. This section describes how to quickly identify performance problems based on the optimization suggestions provided by the Roofline model of MindStudio Advisor.
Example 1: Analyze the performance problems of the MobilenetV3, expanded_conv, depthwise, or depthwise operator. According to the analysis result of MindStudio Advisor, the hardware usage of the operator is low, and the bottleneck ratio of each channel is also low. Therefore, the pipeline parallelism of this operator is abnormal. According to the optimization suggestion "Reduce strong data dependencies between pipelines", analyze whether synchronization dependencies exist between pipelines of the operator. The following figure shows the operator simulation pipeline. The MTE2 pipeline is idle for a long time, and it blocks the MTE1 pipeline. The problem of pipeline interruption does exist.
Figure 4 Pipeline interruption
In the original CCE code, MTE2 is used to transfer a large part of the Featuremap for cube calculation. After the cube calculation is complete, the second part of the Featuremap is also transferred.

Therefore, you can move the position of the second Featuremap transfer command to the position after the first transfer command. In this way, the second transfer can be performed immediately after the first transfer is complete. This reduces data dependencies between different instructions.
In addition, according to the optimization suggestion "Eliminating improper instruction synchronization between pipelines", improper synchronization instructions may exist in the code. Analyze the CCE code. It is found that the pipe_barrier(PIPE_ALL) instruction exists between two pipelines in the code. You can delete the instruction.

Conclusion

You can use MindStudio Advisor to quickly find top bottlenecks of a model, and use the provided optimization suggestions to perform performance analysis, thus improving the model optimization efficiency.

Parent topic: MindStudio Advisor Sample References