Tuning Based on the Objective
After clarifying the optimization objective, adjust the serving parameters to improve the serving performance.
Core Parameters
Table 1 lists the core parameters. For details, see MindIE LLM Development Guide.
Optimization Direction |
Key Parameter |
Recommended Value/Policy |
|---|---|---|
Low latency |
maxPrefillBatchSize |
Small batch size (4 to 16), reducing the calculation workload of the first token. |
supportSelectBatch |
false: The Prefill is forcibly scheduled first. |
|
maxQueueDelayMicroseconds |
≤ 50 ms (reducing the waiting delay) |
|
High throughput |
maxBatchSize (in the Decode phase) |
Maximized (limited by the GPU memory) |
maxPrefillBatchSize/maxPrefillTokens |
Increase the value based on the actual average input to ensure that the value of maxPrefilltokens is around 10,000. |
|
supportSelectBatch |
Enable throughput-first scheduling. |
|
maxQueueDelayMicroseconds |
Increase the waiting time to form a large batch at the beginning of the test. |
|
RequestRate (set by using a test tool) |
Increase the delivery frequency to the upper limit of the hardware. |
|
Concurrency (set by using a test tool) |
Gradually increase the concurrency until the throughput reaches the saturation point. |
|
Three-piece suite for sequence length |
Set maxInputTokenLen, maxIterTimes, and maxSeqLen based on the actual scenarios and requirements. |
Manual Tuning
Tune the parameters that do not meet the objectives, for example, the latency or throughput which does not meet the objectives.
Before tuning, you can confirm the theoretical upper limit and optimal configuration of each parameter, for example, calculating the optimal value of maxBatchSize. For details about the calculation method, see Evaluating the Upper Limit of the Serving Performance.
Tool-based Tuning
Manual tuning requires certain serving basic knowledge. To facilitate tuning, you can use the expert suggestion function of the serving tuning tool (msServiceProfiler). For details, see Introduction to msserviceprofiler. Before using the expert suggestion tool, you need to use MindIE Benchmark to test the serving performance. The expert suggestion tool provides tuning suggestions based on the MindIE Benchmark test result file.
Troubleshooting
The preceding serving tuning is performed in black-box mode. You do not know the specific scheduling process of requests, for example, which requests form a batch, the size of a batch, when to execute Prefill, and when to execute Decode.
Some problems cannot be solved by tuning in black-box mode and need to be further analyzed. You must use msServiceProfiler for analysis, which is similar to optimizing the LLM inference performance.