Buffer Response

This feature is designed for scenarios that demand high inference throughput and low latency from LLMs, where meeting SLO latency targets is critical.

Mainstream LLM inference systems, such as vLLM and TGI, schedule prefill and decode requests independently, sharing computing resources in a time-multiplexed manner. The scheduling policy, specifically whether to prioritize prefill or decode requests, directly affects throughput and latency. However, in prefill-decode hybrid deployments, mutual interference between the prefill and decode phases may cause latency fluctuations, making it difficult to meet the SLO. Therefore, stricter scheduling policies and latency control are required.

This feature monitors SLO latency and strategically delays responses to prevent TTFT and TPOT timeouts. By configuring the expected SLO latency in both the prefill and decode phases, it balances the latency of the two phases and maximize benefits without timeout.

Constraints

This feature is supported by the Atlas 800I A2 inference server.
Only Qwen2 models support this feature.

Parameters

Table 1 describes the parameters required for enabling buffer response.

**Table 1** Supplementary parameters of buffer response: ScheduleConfig
Parameter	Value Type	Value Range	Description
bufferResponseEnabled	Bool	true false	Whether to enable the buffer response feature. true: enables buffer response. false: disables buffer response. Optional. The default value is false.
prefillExpectedTime	uint32_t	≥ 1	Expected SLO latency during token generation in the prefill phase. Optional. The default value is 1500. Recommended value: Set this parameter based on the customer's SLO latency restrictions.
decodeExpectedTime	uint32_t	≥ 1	Expected SLO latency during token generation in the decode phase. Optional. The default value is 50. Recommended value: Set this parameter based on the customer's SLO latency restrictions.

Running Inference

Open the config.json file of the Server.

cd {MindIE installation directory}/latest/mindie-service/
vi conf/config.json

Set serving parameters. Add the bufferResponseEnabled, prefillExpectedTime, and decodeExpectedTime fields (the following content in bold) to the config.json file of the Server. For details about the fields, see Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific). The following is a parameter configuration example:
```
"bufferResponseEnabled" : true,
"prefillExpectedTime" : 1000,
"decodeExpectedTime" : 50
```
Start the service.
```
./bin/mindieservice_daemon
```
Use the AISBench tool to test the performance. For details, see "Performance Test" in MindIE Motor Development Guide.

Parent topic: Acceleration Features