Performance Troubleshooting Process
The basic performance tuning process for an LLM is as follows:
Figure 1 Basic performance tuning flow
The most important aspect of performance tuning is to diagnose the problem correctly, first demarcate the issue, and then apply targeted optimizations.
- First, collect profile data. You can use the Ascend PyTorch Profiler interfaces for data profiling and analysis.
- Next, use MindStudio Insight, the visualization tool, to demarcate the performance issues. The results are typically categorized into three areas: computation, scheduling, and communication.
- In addition, you can directly use the Advisor tool in mstt to assist in locating issues. The Advisor tool automatically analyzes profile data using a built-in case library and provides performance tuning recommendations.
- Finally, you can apply appropriate tuning methods for different issues. After each tuning, re-run the training, collect profile data, and use MindStudio Insight to check whether the tuning methods are effective. Repeat this process until the performance issues are resolved.
Parent topic: Troubleshooting