Performance Tuning Cases

Symptom

During traditional model inference, the throughput is low. You need to locate the issue and optimize the performance.

Tuning Procedure

Collect data.
Use the msProf command of the performance tuning tool to collect profile data.
```
msprof --output=save_path python3 main.py
```
After the execution is complete, the collected data is generated in the directory specified by the --output option.
Import the parsed profile data to MindStudio Insight for analysis.
- Data overview
  Choose Timeline > System View > Overlap Analysis to view the computing and free time percentages, as shown in Figure 1.
  
  Figure 1 Viewing Overlap Analysis
  
  The model inference runs for two iterations. Between iterations, data is released and reloaded (depending on the model execution mode), which results in a relatively large amount of free time. Therefore, you can manually select the data from a single iteration for aggregated analysis, as shown in Figure 2. After analyzing the iteration data, it can be observed that computing accounts for 75% while free accounts for 25%. The figure shows a ModelExecute task, indicating that the current inference mode operates in graph mode, which can reduce scheduling gaps. Overall, the task dispatching is relatively efficient. Further optimization should mainly focus on computation-side issues.
  
  Figure 2 Analyzing data of an iteration
- Memory usage analysis
  The NPU uses a large-core hardware architecture. A larger batch size maximizes the computing resources of the NPU. Increasing the batch size boosts throughput. Check available memory and increase the batch size if sufficient memory exists.
  
  The initial batch size is set to 4, and the memory usage is low, as shown in Figure 3.
  
  Figure 3 Memory usage analysis
  
  The throughput of the inference program running on AISBench is about 174, as shown in Figure 4. For details about how to use the AISBench tool, see AISBench Inference Tool User Guide.
  
  Figure 4 Inference program running result
  
  Increase the batch size to 1024, use the ATC tool to convert the ONNX model to the OM format, and check the memory again. The memory usage is improved, as shown in Figure 5.
  
  Figure 5 Checking the memory
  
  The throughput of the inference program running on AISBench reaches 1869, as shown in Figure 6. The throughput is significantly improved.
  
  Figure 6 Inference program throughput
- Operator analysis
  Operator analysis seeks to boost Cube usage by raising the percentage of AI Core operators. The operator analysis page (Figure 7) shows that Vector-type operators dominate, with many tensor operations but few matrix operations.
  
  Figure 7 Viewing the operator page
  
  Check the timeline in Figure 8. Although the total computing ratio is greatly increased, most of the computing time is spent on ExpandD operators rather than matrix operations. The ExpandD operator is used to extend the dimensions of a tensor. It copies a tensor along a specified dimension to increase the size of the dimension.
  
  Figure 8 Viewing the timeline page
  
  The operator's input type can be modified by adding cast conversions before or after it to boost its performance.
  
  For example, the bool type in Figure 9 is converted to the int32 type in Figure 10. The task duration of the ExpandD operator is shortened from 36223.904 μs to 19.781 μs, significantly improving the operator execution efficiency.
  
  Figure 9 Bool
  
  Figure 10 int32

Parent topic: Performance Issues