Optimizing the Serving Parameters

Symptom

Qwen3-32B is deployed on a single server with four cards as a service. In the application scenario, the number of concurrent requests is small and the number of requests reaches the threshold slowly (the maximum number of concurrent requests is 40, and the actual number of concurrent requests is close to the request rate 9.4). In the default serving configuration of Prefill-Decode hybrid deployment, the performance is tested when the input length is 128 and the output length is 100. The average non-first token latency is 48 ms, but the P99 quantile (TPOT SLO P99) of the non-first token latency is 157 ms. As a result, the streaming output of some requests is frozen, which cannot meet the requirement of 50 ms latency for non-first tokens.

Solution

  1. In the default serving configuration, the preliminary performance test result is shown in Figure 1. It can be found that: (1) The maximum delay of the first token is only 158 ms when the default Prefill priority scheduling is used. This indicates that the Prefill performance is normal and does not reach the computing power bottleneck. (2) The delay of the first token of P75 is less than 50 ms, and the delay of most non-first tokens is normal, indicating that the actual Decode performance meets the expectation. (3) The P90/P99/maximum delay of non-first tokens is high. It is suspected that the request arrives slowly and the Prefill of the new request interrupts the Decode of the request in the inference. As a result, some non-first tokens need to wait for the Prefill of new requests, and the delay of non-first tokens increases sharply.
    Figure 1 Preliminary performance test result
  2. In this scenario, you can enable SupportSelectBatch and set prefillTimeMsPerReq and decodeTimeMsPerReq to adjust the priorities of Prefill and Decode so that the scheduler allows Decode to take precedence in certain cases. This policy can reduce the number of new requests that interrupt the Decode of the request during inference, increase the proportion of continuous Decode, and shorten the non-first token latency, as shown in Figure 2. Although this policy causes Prefill waiting of some requests, the Prefill performance of the model in this case is enough. Even if the first token latency increases slightly, it can be controlled within an acceptable range by adjusting the priority parameter.
    Figure 2 Principle of non-first token latency deterioration
  3. Adjust the priority parameters. A larger value of prefillTimeMsPerReq and a smaller value of decodeTimeMsPerReq indicate a higher Decode priority. Otherwise, the Prefill priority is higher. Set supportSelectBatch to true, prefillTimeMsPerReq to 1000, and decodeTimeMsPerReq to 1. The performance test result is shown in Figure 3. In this case, the first token delay increases to 1562 ms, the average non-first token delay decreases to 31.7 ms, and the P99 non-first token delay decreases to 36 ms. The overall streaming output is smooth, meeting user requirements.
    Figure 3 Performance test result after optimization