SplitFuse

The SplitFuse feature is used to divide a long prompt request into smaller chunks and schedule the chunks in multiple forward steps. The prompt request is generated only after the last forward step is complete. Short prompt requests are combined to accurately fill the gap of steps. In this way, the calculation workload of each step is basically the same, which can achieve a more stable average latency of all requests.

When MindIE uses the prefill-decode hybrid deployment policy by default, requests in the prefill and decode phases are not combined into a single batch simultaneously. However, with SplitFuse enabled, MindIE integrates prefill requests into the same batch when decode requests are processed first and the batch size is less than maxBatchSize.

When feedforward is greater than split chunk tokens, SplitFuse splits it as follows:

In each inference round: , where:
In the prefill phase, tokens indicates the number of input tokens, and in the decode phase, each request has one token:

Two key behaviors:

Long prompts are decomposed into smaller chunks and scheduled in multiple iterations. Only after the last iteration, tokens can be generated.
Short prompts may also be divided into small chunks to maximize the computing efficiency.

Advantages:

Faster response: The latency in processing long prompts is reduced, improving user experience.
Efficiency improvement: Proper combination of short prompts ensures that a model runs at a high throughput.
Enhanced consistency: Unified forward propagation can reduce latency fluctuation and stabilize generation frequency.

Constraints

This feature is supported by the Atlas 800I A2 inference server.
The Llama3.1-70B and Qwen2-72B floating-point models, Qwen3-32B, and Qwen3-30B support this feature.
This feature supports only the W8A8 quantization.
This feature cannot work with Multi-LoRA.
This feature cannot be used with prefill-decode disaggregation, Multi-LoRA, function call, parallel decoding, multi-server inference, MTP, asynchronous scheduling, or long sequence.
This feature supports the n, best_of, and use_beam_search postprocessing parameters.

Parameters

Table 1 and Table 2 list the parameters to be configured to enable the SplitFuse feature.

**Table 1** Supplementary parameters 1 of SplitFuse: ModelConfig in ModelDeployConfig
Parameter	Value Type	Value Range	Description
plugin_params	std::string	"{\"plugin_type\":\"splitfuse\"}"	If the value is set to "{\"plugin_type\":\"splitfuse\"}", SplitFuse is executed. If no plugin function is required, remove this field from the configuration. Restriction: If enableSplit is enabled or templateType is set to Mix, this parameter must be set to enable SplitFuse. (This parameter is optional when SplitFuse is disabled.)

**Table 2** Supplementary parameters 2 of SplitFuse: ScheduleConfig
Parameter	Value Type	Value Range	Description
templateType	std::string	"Standard" or "Mix"	Mix: hybrid inference scenario, where batch processing can be performed for prefill and decode requests at the same time. Standard: default value (required when the feature is disabled), indicating that prefill and decode requests are grouped in batches respectively.
splitStartType	Bool	true false	true: resets the splitting status each time a batch is created and checks whether the splitStartBatchSize condition is met. false: does not reset the splitting status after the splitStartBatchSize condition is met for the first time. The default value is false.
splitStartBatchSize	uint32_t	[0,maxBatchSize]	Splitting is enabled when the number of batches reaches this value. The default value is 16.

Running Inference

Open the config.json file of the Server.

cd {MindIE installation directory}/latest/mindie-service/
vi conf/config.json

Set serving parameters. Add the plugin_params, templateType, splitStartType, splitChunkTokens, and splitStartBatchSize parameters to the config.json file of the Server. For performance tuning, you need to edit the ScheduleConfig part in the config.json file. You are advised to set maxBatchSize and splitChunkTokens to the same value and adjust the values of the two parameters to control the SLO decode latency.

The SplitFuse configuration is displayed in bold in the following parameter configuration example. For details about the SplitFuse parameters, see Table 1 and Table 2. For details about the serving parameters, see Configuration Parameters (Service-Specific).

        "ModelDeployConfig":
        {
            "maxSeqLen" : 65536,
            "maxInputTokenLen" : 65536,
            "truncation" : false,
            "ModelConfig" : [
                {
                    "modelInstanceType": "Standard",
                    "modelName" : "llama3-70b",
                    "modelWeightPath" : "/home/models/llama3-70b/",
                    "worldSize" : 8,
                    "cpuMemSize" : 5,
                    "npuMemSize" : -1,
                    "backendType": "atb",
                    "plugin_params": "{\"plugin_type\":\"splitfuse\"}"
                }
            ]
        },
        "ScheduleConfig":
        {
            "templateType": "Mix",
            "templateName" : "Standard_LLM",
            "cacheBlockSize" : 128,

            "maxPrefillBatchSize" : 40,
            "maxPrefillTokens" : 65536,
            "prefillTimeMsPerReq" : 600,
            "prefillPolicyType" : 0,

            "decodeTimeMsPerReq" : 50,
            "decodePolicyType" : 0,         
            "maxBatchSize" : 256,
            "maxIterTimes" : 512,
            "maxPreemptCount" : 0,
            "supportSelectBatch" : false,
            "maxQueueDelayMicroseconds" : 5000,
          
          "splitStartType": false,
          "splitChunkTokens": 256,
          "splitStartBatchSize": 16
        }

Start the service.
```
./bin/mindieservice_daemon
```
Use the AISBench tool to test the performance. For details, see "Performance Test" in MindIE Motor Development Guide.
Adjust parameters based on the actual values of the TTFT and decode latency.
- If both the TTFT and decode latency (average value: P90) meet the restricted threshold, increase the value of RequestRate.
- If the average decode latency is less than the restricted threshold while the average TTFT is not, the value of RequestRate is greater than the system throughput. In this case, decrease the value of RequestRate.
- If the average TTFT and decode latency meet the threshold requirements but the average P90 decode latency does not, reduce the chunk size. However, this operation may affect the overall throughput.
- When input questions vary in length, the prefill-decode hybrid deployment policy tends to generate more scheduling bubbles. In contrast, the SplitFuse feature is less impacted by such bubbles, resulting in superior performance.

Parent topic: Scheduling Features