SLO Scheduling Optimization

An SLO defines a target value for a specific metric over a certain period of time. To handle high-concurrency client requests and improve system throughput while meeting SLO requirements, the following approaches are provided:

1. Prefill/Decode phase selection algorithm based on TTFT/TPOT latency prediction and the Least Laxity First (LLF) algorithm

This algorithm collects TTFT and TPOT latency data for fitting modeling to predict the execution time of each prefill or decode phase, and uses the LLF algorithm to determine whether prefill or decode is executed for the next batch. It is suitable for scenarios with strict requirements on both TTFT and TPOT, enabling higher throughput under high-concurrency workloads while meeting SLO requirements.

2. Dynamic batch size adjustment algorithm based on real-time TPOT awareness

This algorithm continuously observes the system TPOT latency and compares it with the SLO-defined decode latency target. Depending on the comparison result, maxPrefillBatchSize and maxBatchSize are dynamically adjusted to prevent all requests from being loaded into on-chip memory, which could cause system congestion and degrade throughput. This algorithm is suitable for scenarios with strict requirements on TPOT, prioritizing responses to requests already loaded into on-chip memory under high-concurrency workloads. Due to real-time fluctuations in TPOT data collection, the actual latency may deviate by roughly 10% from the configured target.

Constraints

This feature is supported only by the Atlas 800I A2 inference server.
Only Qwen models support this feature.
This feature is applicable only to the prefill-decode hybrid deployment scenario and cannot be enabled together with SplitFuse.
This feature provides significant benefits for short outputs (less than 256 tokens). As the output length increases, the throughput gain decreases.

Parameters

To enable SLO scheduling optimization, set the parameters described in Table 1.

**Table 1** Parameters for SLO scheduling optimization
Parameter	Value Type	Value Range	Description
stageSelectPolicy	uint32_t	[0,2]	Prefill/Decode selection policy. 0: prioritizes prefill. 1: prioritizes throughput. 2: determines whether to execute prefill or decode based on TTFT/TPOT latency prediction and the LLF algorithm. Optional. The default value is 0.
dynamicBatchSizeEnable	Bool	true false	Specifies whether to enable the dynamic batch size adjustment algorithm. Optional. The default value is false.
prefillExpectedTime	uint32_t	[0,10000]	Expected SLO latency during token generation in the prefill phase. Optional. The default value is 1500.
decodeExpectedTime	uint32_t	[0,10000]	Expected SLO latency during token generation in the decode phase. Optional. The default value is 50.

Running Inference

This section describes how to use the SLO scheduling optimization function.

Open the config.json file of the Server.

cd {MindIE installation directory}/latest/mindie-service/
vi conf/config.json

Set serving parameters. Add the stageSelectPolicy, dynamicBatchSizeEnable, prefillExpectedTime, and decodeExpectedTime fields (the following content in bold) to the config.json file of the Server. For details about the fields, see Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific). The following is a parameter configuration example:
```
"stageSelectPolicy" : 2,
"dynamicBatchSizeEnable" : true,
"prefillExpectedTime" : 1000,
"decodeExpectedTime" : 50
```
Start the service.
```
./bin/mindieservice_daemon
```

Start tuning. This example uses the AISBench tool and GSM8K dataset, with concurrency set to 500. The following shows the AISBench tool configuration. For details, see "Performance Test" in MindIE Motor Development Guide.

from ais_bench.benchmark.models import VLLMCustomAPIChatStream
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr='vllm-api-stream-chat',
        path="$ModelPath",
        model="$ModelName",
        request_rate = $1,
        retry = 2,
        host_ip = "{ipAddress}",
        host_port = "{port}",
        max_out_len = 64,
        batch_size= 500,
        trust_remote_code=False,
        generation_kwargs = dict(
            temperature = 0,
            ignore_eos = True
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    )
]

Parent topic: Scheduling Features