Overview
This chapter applies only to Kubernetes-based cluster service deployment. For details about the deployment diagram, see Figure 1.
Based on the distribution and inference modes of Server inference service instances on cluster compute nodes (inference servers), there are two available deployment modes:
Deployment Mode |
Description |
|---|---|
Single-node service (non-distributed) |
A single Server can be independently used as an inference service instance to deliver inference services to external systems. Depending on the compute node resources of a cluster, it can support one or multiple compute nodes, and one or more Servers can be deployed on a single compute node. For more details, see Single-Node (Non-Distributed) Service Deployment. |
Single-container prefill-decode disaggregation |
The Controller, Coordinator, Prefill, and Decode are deployed in one container in the prefill-decode disaggregation scenario. The RESTful interfaces for external services are the same as those in prefill-decode hybrid deployment. |
Prefill-decode disaggregation |
Multiple Servers are jointly deployed on one or more compute nodes, incorporating both prefill instances (prefill compute instances) and decode instances (decode compute instances). These instances are deployed separately, collaborating as a group to deliver inference services. For more details, see Prefill-Decode Disaggregation. |
Large-scale EP and prefill-decode disaggregation |
The large-scale EP and prefill-decode disaggregation service of the MoE model is different from the multi-container prefill-decode disaggregation service: Each prefill-decode group of a decode instance has an independent MindIE LLM RESTful interface exposed to the Coordinator. |
If the inference appliance is deployed on a single instance (single-node distillation version or dual-node full-sized version), services will be interrupted due to hardware faults, and the service recovery duration is uncontrollable. Therefore, it is recommended that the inference appliance be deployed on multiple instances. If a hardware fault occurs on a single instance, load balancing across multiple instances ensures continued service availability.
