LLM Inference Performance Tuning

In common models and application scenarios, similar performance can be reproduced theoretically if the same weight, image, and deployment policy are used.

Therefore, if a performance problem occurs, you can check the environment variables and performance-related switches first. There is a high probability that the problem is caused by the environment variables and performance-related switches (for example, no core is bound, the kernel version is low, or the log level is incorrect).

You can use the benchmark-mindie plug-in of AISBench to evaluate the LLM inference performance test.

If the LLM inference test result shows that the performance deteriorates, you can perform detailed Profiling analysis. For details about the analysis method, see In-depth Analysis for Model Tuning (MindStudio Insight).

Preliminary LLM Inference Test

The LLM inference test determines the size of the test request based on the --batch_size parameter entered by the user and determines the combination of the request input length and output length based on the --case_pair parameter. During inference, the Prefill of all requests is processed first, and then all requests are grouped into a Decode Batch to complete the Decode inference. For test scenarios that are not covered by the baseline, especially scenarios with TTFT or TPOT latency requirements, perform the following steps to perform a LLM inference test:

Set --case_pair[ [Input length, 1] ] to simulate an LLM inference test of a single Prefill Batch. In this case, the specified --batch_size parameter is recorded as prefill_batchsize, and Total Time(s) in the test result is recorded as prefill_time.
Set the --case_pair parameter based on the actual input and output lengths to perform the test. In this case, the specified --batch_size parameter is recorded as decode_batchsize, and Non-first token time(ms) in the test result is recorded as decode_token_time.

Adjust decode_batchsize and prefill_batchsize to find the configuration items that meet the requirements, as shown in Table 1.

**Table 1** Configuration
Scenario		Configuration Method
Baseline throughput without delay limit		Increase decode_batchsize to OOM. (Alternatively, select decode_batchsize whose growth rate slows down.)
TPOT limited, but TTFT not limited		Adjust decode_batchsize to ensure that decode_token_time meets the TPOT limit.
TPOT and TTFT limited	Ramp-up test (with Request Rate limited)	Adjust decode_batchsize to ensure that decode_token_time meets the TPOT limit. Adjust prefill_batchsize to ensure that prefill_time is close to the TTFT limit.
TPOT and TTFT limited	Fixed concurrency (without Request Rate limited)	Adjust decode_batchsize to ensure that decode_token_time meets the TPOT limit. Set prefill_batchsize=dp, and decrease decode_batchsize until is close to the average TTFT limit.

If you want to optimize the LLM inference performance (such as the parallel policy and environment variables), repeat 1 for different configurations until the optimal configuration is obtained.
Convert Decode Batchsize and Prefill Batchsize of the LLM inference to serving parameters maxBatchsize and maxPrefillBatchSize.
Set other serving and test parameters. Then, adjust the maximum values of concurrent Requestrate, maxBatchSize, and maxPrefillBatchsize based on the actual test result.

Parent topic: Solution for the MindIE Inference Performance