Overall Guideline

The MindIE inference performance can be optimized from two perspectives: large language model (LLM) inference and serving.

Check whether the performance of the current LLM inference can be optimized.

  1. If scenarios are covered in the version baseline, compare the performance with that of the version baseline and check the configuration.
  2. If no scenario is covered in the version baseline or the problem persists after the configuration is checked, perform an LLM inference test with the same input and output.
  3. If the LLM inference test result does not meet the expectation, optimize the LLM inference performance. If the result meets the expectation, optimize the serving performance.
  4. Locate performance bottlenecks for serving optimization, as shown in Figure 1.
Figure 1 Flowchart for locating serving performance bottlenecks