Overview

This chapter applies only to Kubernetes-based cluster service deployment. For details about the deployment diagram, see Figure 1.

Figure 1 Overall deployment view of a Kubernetes cluster

Based on the distribution and inference modes of Server inference service instances on cluster compute nodes (inference servers), there are two available deployment modes:

**Table 1** Deployment modes
Deployment Mode	Description
Single-node service (non-distributed)	A single Server can be independently used as an inference service instance to deliver inference services to external systems. Depending on the compute node resources of a cluster, it can support one or multiple compute nodes, and one or more Servers can be deployed on a single compute node. For more details, see Single-Node (Non-Distributed) Service Deployment.
Single-container prefill-decode disaggregation	The Controller, Coordinator, Prefill, and Decode are deployed in one container in the prefill-decode disaggregation scenario. The RESTful interfaces for external services are the same as those in prefill-decode hybrid deployment.
Prefill-decode disaggregation	Multiple Servers are jointly deployed on one or more compute nodes, incorporating both prefill instances (prefill compute instances) and decode instances (decode compute instances). These instances are deployed separately, collaborating as a group to deliver inference services. For more details, see Prefill-Decode Disaggregation.
Large-scale EP and prefill-decode disaggregation	The large-scale EP and prefill-decode disaggregation service of the MoE model is different from the multi-container prefill-decode disaggregation service: Each prefill-decode group of a decode instance has an independent MindIE LLM RESTful interface exposed to the Coordinator.

If the inference appliance is deployed on a single instance (single-node distillation version or dual-node full-sized version), services will be interrupted due to hardware faults, and the service recovery duration is uncontrollable. Therefore, it is recommended that the inference appliance be deployed on multiple instances. If a hardware fault occurs on a single instance, load balancing across multiple instances ensures continued service availability.

Parent topic: Cluster Service Deployment