Buffer Response
This feature is designed for scenarios that demand high inference throughput and low latency from LLMs, where meeting SLO latency targets is critical.
Mainstream LLM inference systems, such as vLLM and TGI, schedule prefill and decode requests independently, sharing computing resources in a time-multiplexed manner. The scheduling policy, specifically whether to prioritize prefill or decode requests, directly affects throughput and latency. However, in prefill-decode hybrid deployments, mutual interference between the prefill and decode phases may cause latency fluctuations, making it difficult to meet the SLO. Therefore, stricter scheduling policies and latency control are required.
This feature monitors SLO latency and strategically delays responses to prevent TTFT and TPOT timeouts. By configuring the expected SLO latency in both the prefill and decode phases, it balances the latency of the two phases and maximize benefits without timeout.
Constraints
- This feature is supported by the Atlas 800I A2 inference server.
- Only Qwen2 models support this feature.
Parameters
Table 1 describes the parameters required for enabling buffer response.
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
bufferResponseEnabled |
Bool |
|
Whether to enable the buffer response feature.
Optional. The default value is false. |
prefillExpectedTime |
uint32_t |
≥ 1 |
Expected SLO latency during token generation in the prefill phase. Optional. The default value is 1500. Recommended value: Set this parameter based on the customer's SLO latency restrictions. |
decodeExpectedTime |
uint32_t |
≥ 1 |
Expected SLO latency during token generation in the decode phase. Optional. The default value is 50. Recommended value: Set this parameter based on the customer's SLO latency restrictions. |
Running Inference
- Open the config.json file of the Server.
cd {MindIE installation directory}/latest/mindie-service/ vi conf/config.json - Set serving parameters. Add the bufferResponseEnabled, prefillExpectedTime, and decodeExpectedTime fields (the following content in bold) to the config.json file of the Server. For details about the fields, see Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific). The following is a parameter configuration example:
"bufferResponseEnabled" : true, "prefillExpectedTime" : 1000, "decodeExpectedTime" : 50
- Start the service.
./bin/mindieservice_daemon
- Use the AISBench tool to test the performance. For details, see "Performance Test" in MindIE Motor Development Guide.