Tuning Based on the Objective

After clarifying the optimization objective, adjust the serving parameters to improve the serving performance.

Core Parameters

Table 1 lists the core parameters. For details, see MindIE LLM Development Guide.

Table 1 Core parameters

Optimization Direction

Key Parameter

Recommended Value/Policy

Low latency

maxPrefillBatchSize

Small batch size (4 to 16), reducing the calculation workload of the first token.

supportSelectBatch

false: The Prefill is forcibly scheduled first.

maxQueueDelayMicroseconds

≤ 50 ms (reducing the waiting delay)

High throughput

maxBatchSize (in the Decode phase)

Maximized (limited by the GPU memory)

maxPrefillBatchSize/maxPrefillTokens

Increase the value based on the actual average input to ensure that the value of maxPrefilltokens is around 10,000.

supportSelectBatch

Enable throughput-first scheduling.

maxQueueDelayMicroseconds

Increase the waiting time to form a large batch at the beginning of the test.

RequestRate (set by using a test tool)

Increase the delivery frequency to the upper limit of the hardware.

Concurrency (set by using a test tool)

Gradually increase the concurrency until the throughput reaches the saturation point.

Three-piece suite for sequence length

Set maxInputTokenLen, maxIterTimes, and maxSeqLen based on the actual scenarios and requirements.

Manual Tuning

Tune the parameters that do not meet the objectives, for example, the latency or throughput which does not meet the objectives.

Before tuning, you can confirm the theoretical upper limit and optimal configuration of each parameter, for example, calculating the optimal value of maxBatchSize. For details about the calculation method, see Evaluating the Upper Limit of the Serving Performance.

Tool-based Tuning

Manual tuning requires certain serving basic knowledge. To facilitate tuning, you can use the expert suggestion function of the serving tuning tool (msServiceProfiler). For details, see Introduction to msserviceprofiler. Before using the expert suggestion tool, you need to use MindIE Benchmark to test the serving performance. The expert suggestion tool provides tuning suggestions based on the MindIE Benchmark test result file.

Troubleshooting

The preceding serving tuning is performed in black-box mode. You do not know the specific scheduling process of requests, for example, which requests form a batch, the size of a batch, when to execute Prefill, and when to execute Decode.

Some problems cannot be solved by tuning in black-box mode and need to be further analyzed. You must use msServiceProfiler for analysis, which is similar to optimizing the LLM inference performance.