Performance Troubleshooting Process

The basic performance tuning process for an LLM is as follows:

Figure 1 Basic performance tuning flow

The most important aspect of performance tuning is to diagnose the problem correctly, first demarcate the issue, and then apply targeted optimizations.

  1. First, collect profile data. You can use the Ascend PyTorch Profiler interfaces for data profiling and analysis.
  2. Next, use MindStudio Insight, the visualization tool, to demarcate the performance issues. The results are typically categorized into three areas: computation, scheduling, and communication.
  3. In addition, you can directly use the Advisor tool in mstt to assist in locating issues. The Advisor tool automatically analyzes profile data using a built-in case library and provides performance tuning recommendations.
  4. Finally, you can apply appropriate tuning methods for different issues. After each tuning, re-run the training, collect profile data, and use MindStudio Insight to check whether the tuning methods are effective. Repeat this process until the performance issues are resolved.