Scenario

Introduction to Prefill-Decode Disaggregation Deployment

The prefill-decode disaggregation deployment decouples the prefill and decode inference phases, enabling flexible deployment across single-node or multi-node environments. It is well-suited for latency-sensitive scenarios. Prefill-decode disaggregation can improve NPU utilization, especially for LLMs. Prefill instances and decode instances are deployed separately to reduce the interference caused by time division multiplexing in the prefill and decode phases in terms of latency, improving throughput with the same latency. Figure 1 shows the principle of prefill-decode disaggregation.

Currently, the following prefill-decode disaggregation deployment modes are supported:

Single-node deployment: Controller, Coordinator, and Server run in a single pod. This mode is applicable to the scenario where services are deployed on one server.
Multi-node deployment: Controller, Coordinator, and Server run in independent pods. This mode is applicable to the scenario where services are deployed on multiple servers.

Figure 1 Principles of prefill-decode disaggregation

LLM inference can be divided into the prefill and decode phases.

Prefill phase: In a generative language model, the prefill phase involves processing initial prompts and generating initial hidden states. This phase typically includes a forward propagation of the entire model and is computationally intensive. A prefill operation is required for each new input sequence.
Decode phase: Following the prefill phase, the model incrementally generates subsequent text based on the initial hidden states. This phase involves fewer calculations but requires repeated computations until sufficient text is generated or a termination condition is met. During this generation process, only the latest activated token is calculated, and attention mechanisms are employed to determine the final predicted token.

Deployment Solution

Single-node deployment:
The inference entry of the prefill-decode cluster is opened through Kubernetes Service. Create a Kubernetes Deployment to deploy a pod, where Controller (single-process replica), Coordinator (single-process replica), and Server (multi-process replica) are deployed.

Figure 2 Single-node deployment solution
Multi-node deployment:
The inference entry of the prefill-decode cluster is opened to the Coordinator pod through Kubernetes Service. Create three Kubernetes Deployments to deploy Controller (single-pod replica), Coordinator (single-pod replica), and Server (multi-pod replica).

Figure 3 Multi-node deployment solution

Advantages of Prefill-Decode Disaggregation

Advantages:

Resource utilization optimization: The computing is intensive in the prefill phase and sparse in the decode phase. Separating the two phases can more effectively utilize the computing resources of NPUs.
Improved throughput: The prefill and decode phases can handle different requests simultaneously. This means that while a new request is being processed in the prefill phase, the decode phase can continue decoding the previous request, enhancing the overall processing capability.
Reduced latency: Implementing the prefill and decode phases separately can reduce waiting time, especially when multiple requests arrive concurrently.

Restrictions

Single-node deployment
- This feature is supported only by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
- The number of NPUs used by different prefill and decode nodes must be the same.
- LLaMA3-8B, Qwen2.5-7B, and Qwen3-8B support this feature.
- This feature cannot be used together with the prefix cache feature.
- This feature cannot be used together with sparse quantization and KV cache INT8 quantization.
- This feature does not support postprocessing parameters related to multi-sequence inference, such as n, best_of, use_beam_search, logprobs, and top_logprobs.
Multi-node deployment
- This feature is supported only by the Atlas 800I A2 inference server.
- The devices running on both the prefill and decode nodes must be of the same type.
- The number of NPUs used by prefill and decode nodes must be the same.
- NPU network ports are interconnected at 200 Gbit/s of bandwidth.
- This feature cannot be used together with Multi-LoRA, parallel decoding, SplitFuse, prefix cache, function call, multi-node inference, and long sequence features.
- The LLaMA3 series, Qwen2 series, Qwen3 series, and DeepSeek series models support this feature.
- This feature cannot be used together with sparse quantization and KV cache INT8 quantization.
- This feature does not support postprocessing parameters related to multi-sequence inference, such as n, best_of, use_beam_search, logprobs, and top_logprobs.
Currently, a stop string cannot be used to stop text inference (that is, the stop and include_stop_str_in_output parameters are not supported). For details about the parameters, see "API Description" > RESTful API Reference" > "Compatible with OpenAI APIs" > "Inference APIs" in MindIE LLM Development Guide.

Hardware Environment

Table 1 lists the hardware environment supported by prefill-decode disaggregation deployment.

**Table 1** Supported hardware
Type	Model	Memory
Server	Atlas 800I A2 inference server	32GB 64GB
Server	Atlas 800I A3 SuperPoD Server	64GB

A cluster must support parameter plane interconnection. This means that the server NPU ports must be in the same VLAN and capable of communication through RoCE.
To maintain service stability, users should strictly control the permissions of custom pods to prevent high-privilege pods from modifying internal parameters of MindIE, which may cause exceptions.

Parent topic: Prefill-Decode Disaggregation