Analysis Sample of Function Computing Performance Tuning in Network Applications

Background

It is found that the overall time spent by a PyTorch network application to perform inference on the Ascend platform is long. To find out the cause, we use the Profiling tool to analyze inference durations of the network application. The analysis result shows that execution of the aclmdlExecute API takes a long time. Further analysis shows that the execution time of the Conv operator is the longest.

We can check the OM model converted from the PyTorch network to query the Conv operator. It is found that the operator consists of multiple compute units, which cause high inference overheads. The function that houses the Conv operator is the Mish activation function. Currently, the Ascend platform supports only the ReLU, Leaky ReLU, PReLU, ELU, and SReLU activation functions. As a result, the Mish function, after model conversion, is split into multiple compute units.

Replace the Mish activation function in the OM model with an activation function supported by the Ascend platform to reduce the inference duration. We take the replacement of Mish with Leaky ReLU as an example. After the replacement, perform profiling again. The result shows that the inference duration is significantly reduced.

This section describes only the Profiling tool-related operations and analysis process. Other operations such as operator analysis and function replacement of the OM model are not described here.
For details about how to convert a PyTorch model into an OM model, see "Saving and Exporting a Model" in PyTorch Training Model Porting and Tuning Guide.

Profiling Operations

Start MindStudio IDE and open a built project.
For details about how to create and build an application project, see Application Development.
On the menu bar, choose Ascend > System Profiler > New Project. The profiling configuration window is displayed.
In the window shown in Figure 1, set Project Name and Project Location. Click Next.
Figure 1 Configuring Project Properties
Access the Executable Properties page. Set the path for storing the executable file of the profiling project. See Figure 2.
Figure 2 Executable Properties
Access the Profiling Options page and select Task-based. See Figure 3.
Figure 3 Task-based scenario
After the preceding configurations are complete, click Start in the lower right corner of the window to start Profiling.
The profiling results will be automatically displayed in the MindStudio IDE window after the execution is complete. See Figure 4.

Figure 4 Profiling result

Fault Analysis

According to the field description in the Profiling Timeline View, the time consumption data of models and operators is displayed in the AscendCL API field.
Set the color ratio to "20% ≤ Yellow < 50% ≤ Red" on the palette, as shown in Figure 5. For details, see Timeline Color Configuration.

Figure 5 Import Result setting

The new timeline scale is obtained and the view is zoomed in, as shown in Figure 6. It shows that there are two APIs that take the longest periods of time: aclmdlLoadFromFileWithMem and aclmdlExcute.

Figure 6 AscendCL API Timeline
Right-click the AscendCL API field, choose Show in Event View from the shortcut menu, and sort the Duration column in descending order, as shown in Figure 7.
Figure 7 Event View
The picture shows that the two most time-consuming AscendCL APIs are aclmdlLoadFromFileWithMem and aclmdlExcute.
Continue to check the Statistics View of AscendCL API and sort the APIs by the API call duration percentage in descending order, as shown in Figure 8.
Figure 8 AscendCL API Statistics

It can be basically determined that the two APIs that consume the most time during application inference are aclmdlLoadFromFileWithMem and aclmdlExcute.

It is found in "AscendCL API Reference" in the AscendCL Application Software Development Guide (C&C++) that aclmdlLoadFromFileWithMem loads an offline model from a file. According to the analysis, the time spent by the API depends on the time required for loading the offline model. Currently, the loading time cannot be tuned.
aclmdlExecute is a synchronous API that executes a model for inference until the inference result is returned.
It is an execution API. That is, the application contains an API that takes a long time during inference execution. The total execution time of all operators in the model is the model execution time.

Then check the operator execution time in AI Core Metrics View of the profiling result and sort the results by Task Duration in descending order, as shown in Figure 9.

Figure 9 AI Core Metrics

It is found that the execution time of the first Conv operator is 3170.625 μs, which is much longer than that of other operators in the same process. It can be determined that this operator slows down the overall execution efficiency.

The task of the Profiling tool is complete.
By querying the code, we find that Softplus, Tanh, and Mul in the compute units are formulas of the Mish activation function, as shown in Figure 10.
The Ascend platform supports only the ReLU, Leaky ReLU, PReLU, ELU, and SReLU activation functions. The Mish function is not supported currently. As a result, after model conversion, it is split into multiple compute units.
Figure 10 Mish activation function

The easiest way to solve this problem is to find a more efficient alternative function.

Troubleshooting

Use the Leaky ReLU activation function provided by Ascend as a replacement function. Function replacement is performed by users. Details are not described herein.

After the function replacement is complete, execute Profiling Operations again. In the new result, the execution duration of the first Conv operator of the Leaky ReLU function in the AI Core Metrics view is 1415.182 μs, which is half of the execution duration of the first Conv operator of the Mish function (3170.625 μs). In addition, the accuracy value of the Leaky ReLU function is 1% lower than that of the Mish function, so the accuracy of the Leaky ReLU function is higher. See Figure 11.

Figure 11 Running result of the Leaky ReLU function operator

Conclusion

After using the Profiling tool to analyze and compare the two execution durations of network application inference, it is found that after the Leaky ReLU activation function is used, the execution duration of the Conv operator in application inference is reduced and the inference efficiency is improved.

Parent topic: Performance Analysis Sample Reference