Performance Tuning Cases
Symptom
During traditional model inference, the throughput is low. You need to locate the issue and optimize the performance.
Tuning Procedure
- Collect data.
Use the msProf command of the performance tuning tool to collect profile data.
msprof --output=save_path python3 main.py
After the execution is complete, the collected data is generated in the directory specified by the --output option.
- Import the parsed profile data to MindStudio Insight for analysis.
- Data overview
Choose Timeline > System View > Overlap Analysis to view the computing and free time percentages, as shown in Figure 1.
The model inference runs for two iterations. Between iterations, data is released and reloaded (depending on the model execution mode), which results in a relatively large amount of free time. Therefore, you can manually select the data from a single iteration for aggregated analysis, as shown in Figure 2. After analyzing the iteration data, it can be observed that computing accounts for 75% while free accounts for 25%. The figure shows a ModelExecute task, indicating that the current inference mode operates in graph mode, which can reduce scheduling gaps. Overall, the task dispatching is relatively efficient. Further optimization should mainly focus on computation-side issues.
- Memory usage analysis
The NPU uses a large-core hardware architecture. A larger batch size maximizes the computing resources of the NPU. Increasing the batch size boosts throughput. Check available memory and increase the batch size if sufficient memory exists.
The initial batch size is set to 4, and the memory usage is low, as shown in Figure 3.
The throughput of the inference program running on AISBench is about 174, as shown in Figure 4. For details about how to use the AISBench tool, see AISBench Inference Tool User Guide.
Increase the batch size to 1024, use the ATC tool to convert the ONNX model to the OM format, and check the memory again. The memory usage is improved, as shown in Figure 5.
The throughput of the inference program running on AISBench reaches 1869, as shown in Figure 6. The throughput is significantly improved.
- Operator analysis
Operator analysis seeks to boost Cube usage by raising the percentage of AI Core operators. The operator analysis page (Figure 7) shows that Vector-type operators dominate, with many tensor operations but few matrix operations.
Check the timeline in Figure 8. Although the total computing ratio is greatly increased, most of the computing time is spent on ExpandD operators rather than matrix operations. The ExpandD operator is used to extend the dimensions of a tensor. It copies a tensor along a specified dimension to increase the size of the dimension.
The operator's input type can be modified by adding cast conversions before or after it to boost its performance.
For example, the bool type in Figure 9 is converted to the int32 type in Figure 10. The task duration of the ExpandD operator is shortened from 36223.904 μs to 19.781 μs, significantly improving the operator execution efficiency.
- Data overview









