Parallel Decoding
In LLM inference scenarios, conventional auto-regressive decoding is inherently slow due to its step-by-step nature, which restricts concurrency. Although the inference phase is constrained by memory bandwidth, it often has excessive computing resources. To address this imbalance, parallel decoding introduces speculative execution—an optimization technique commonly used in processor architectures—that leverages the excessive computing resources to improve concurrency. However, enabling parallel decoding requires the prompt input to retain a trie-tree and a draft token map, which affects the TTFT.
Advantages of parallel decoding:
In small-batch inference scenarios—such as those involving sufficiently long inputs/outputs or code generation—parallel decoding can offset limited memory bandwidth by utilizing excess computing resources, thereby enhancing computing efficiency. The effectiveness of parallel decoding is closely tied to the ratio of validated tokens. As a result, greedy decoding offers the greatest benefit, while sampling and penalty mechanisms may reduce its impact.
To fully leverage parallel decoding, the following conditions should be met:
- A low number of concurrent requests, constrained memory bandwidth, and surplus computational resources.
- Sufficiently long input to provide an initial source of candidate tokens.
- Extended output length, allowing parallel decoding to reduce inference steps and deliver performance gains.
Two parallel decoding algorithms are supported, distinguished by their respective methods of candidate token generation, as illustrated in Table 1.
Parallel Decoding Algorithm |
Candidate Token Generation |
Applicable Scenario |
|---|---|---|
memory_decoding |
Uses a trie-tree to cache historical inputs and outputs of a model and obtain candidate tokens. |
Code generation or retrieval |
lookahead |
Generates candidate tokens based on Jacobi iteration, prompts, and output results. |
Text generation, dialog systems, and diversified query answering |
Constraints
- This feature is supported by the Atlas 800I A2 inference server and Atlas 300I Duo inference card.
- Only the Llama3 series, Qwen2 series, Qwen2.5 series, Qwen3-14B, and Qwen3-32B models support this feature.
- Parallel decoding supports only W8A8 quantization and sparse quantization.
- This feature cannot be used with prefill-decode disaggregation, Multi-LoRA, SplitFuse, long sequence, MTP, asynchronous scheduling, or multi-server inference.
- This feature does not support postprocessing parameters related to multi-sequence inference, such as n, best_of, use_beam_search, logprobs, and top_logprobs.
- Streaming inference is not supported in parallel decoding scenarios.
- Parallel decoding penalty postprocessing supports only repetition penalty.
- The lookahead and memory_decoding algorithms cannot be enabled at the same time.
Parameters
To enable the parallel decoding feature, set required parameters based on Table 2 to Table 6.
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
speculationGamma |
uint32_t |
Related to plugin parameters |
In memory_decoding mode, the value of this field must be greater than or equal to that of decoding_length. It is recommended that the value be equal to decoding_length. |
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
maxIterTimes |
uint32_t |
Related to plugin parameters |
If dynamic_algo is set to true, the value must be greater than or equal to the expected output length + value of speculationGamma. For example, if the expected maximum output length is 512, the value must be greater than or equal to 512 + speculationGamma. |
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
plugin_params |
std::string |
plugin_type: la level: [3, 16] window: [1, 16] guess_set_size: [1, 16] |
If plugin_type is set to la, lookahead is used for parallel decoding. level, window, and guess_set_size correspond to N, W, and G parameters in the lookahead algorithm. Their default values are 4, 5, and 5, respectively. The upper limit of each parameter cannot exceed 16. Configuration example: "{\"plugin_type\":\"la\",\"level\": 4,\"window\": 5,\"guess_set_size\": 5}" |
Running Inference
- Open the config.json file of the Server.
cd {MindIE installation directory}/latest/mindie-service/ vi conf/config.json - Set serving parameters. Add the corresponding parameters to the config.json file of the Server based on Table 2 to Table 6. For details about the serving parameters, see Configuration Parameters (Service-Specific). The parameter configuration examples are as follows.
Configuration example of the memory_decoding algorithm:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
"ModelDeployConfig" : { "maxSeqLen" : 2560, "maxInputTokenLen" : 2048, "truncation" : false, "speculationGamma": 16, "ModelConfig" : [ { "plugin_params":"{\"plugin_type\":\"memory_decoding\",\"decoding_length\":16,\"dynamic_algo\":true}", "modelInstanceType" : "Standard", "modelName" : "llama3-70b", "modelWeightPath" : "/data/weights/llama3-70b", "worldSize" : 4, "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false } ] }
Configuration example of the lookahead algorithm:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
"ModelDeployConfig" : { "maxSeqLen" : 2560, "maxInputTokenLen" : 2048, "truncation" : false, "speculationGamma": 30, "ModelConfig" : [ { "plugin_params":"{\"plugin_type\":\"la\",\"level\":4,\"window\":5,\"guess_set_size\":5}", "modelInstanceType" : "Standard", "modelName" : "Qwen2.5-7B-Instruct", "modelWeightPath" : "/data/weights/Qwen2.5-7B-Instruct", "worldSize" : 1, "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false } ] }
- Start the service.
./bin/mindieservice_daemon