Deploying a Multi-Node Prefill-Decode Disaggregation Service Using kubectl
Restrictions
- This mode supports the deployment of Server, Coordinator, and Controller in prefill-decode disaggregation scenarios.
- For Atlas 800I A2 inference server, the NPU IP address needs to be configured and Ascend Operator needs to be installed.
- The current deployment script does not support rescheduling upon NPU faults.
- In heterogeneous scenarios, the prefill instances must be deployed on Atlas 800I A2 inference server (32 GB), and the decode instances must be deployed on Atlas 800I A2 inference server (64 GB).
Script Description
This section describes how to use the scripts in the MindIE Motor installation directory (examples/kubernetes_deploy_scripts) to deploy and uninstall a prefill-decode disaggregation cluster on MindIE in one-click mode. The cluster administrator can use Kubernetes kubectl to operate the cluster offline by referring to these scripts.
The cluster administrator only needs to compile the startup script, configure services and Kubernetes on the management node, and then call the deployment script to automatically deliver the service configuration and startup script, generate a global ranktable, and schedule pods to compute nodes.
- The MindIE installation and deployment scripts must be executed by the Kubernetes administrator to prevent malicious tampering with scripts or configurations, which may cause arbitrary command execution or container escape risks.
- The Kubernetes administrator must strictly control the write, update, and delete permissions of the MindIE ConfigMap. It is recommended that the permission on the installation directory be set to 750 and the permission on the file be set to 640. To prevent security risks caused by malicious modification and mounting to the pod, you are advised to use namespaces and RBAC to restrict permissions.
- If the TLS certificate is enabled for the MindIE service, you need to remove the three health probe configurations (readinessProbe, livenessProbe, startupProbe, and subfields) of Kubernetes from the controller_init.yaml and coordinator_init.yaml files. Otherwise, the service may be frequently restarted by Kubernetes. The MindIE service ensures high reliability through mechanisms such as active/standby switchover and pod rescheduling. Therefore, removing these health probes does not compromise service reliability.
- When requests are sent faster than they are processed, the Coordinator caches the unprocessed requests, leading to increased memory usage. As a result, request sending may be terminated because the memory usage reaches the upper limit. In this case, you need to appropriately increase the values of memory under the requests and limits fields in the coordinator_init.yaml file.
- The memory parameter under the requests field indicates the minimum memory required for Coordinator running.
- The memory parameter under the limits field indicates the upper limit of the memory available to the Coordinator.
To ensure that the Coordinator can reliably obtain the memory, the values of memory under the requests and limits fields should be the same (if possible). By default, the recommended memory specifications are as follows:
- 12Gi for a maximum of 10,000 concurrent requests
- 24Gi for a maximum of 20,000 concurrent requests
- 48Gi for a maximum of 40,000 concurrent requests
- 108Gi for a maximum of 90,000 concurrent requests
Maximum memory that can be occupied by the Coordinator = Value of body_limit x Value of max_requests x 120%
- body_limit: maximum number of bytes in a single request message body, in MB. You can configure and view the value in the ms_coordinator.json configuration file. For parameter details, see Parameters in the ms_coordinator.json Startup Configuration File.
- max_requests: maximum number of concurrent requests. You can configure and view the value in the ms_coordinator.json configuration file. For parameter details, see Parameters in the ms_coordinator.json Startup Configuration File.
Directory structure of scripts:
├── boot_helper
│ ├── boot.sh
│ ├── gen_config_single_container.py
│ ├── get_group_id.py
│ ├── mindie_cpu_binding.py
│ ├── server_prestop.sh
│ └── update_mindie_server_config.py
├── chat.sh
├── conf
├── delete.sh
├── deploy.sh
├── deploy_ac_job.py
├── deployment
│ ├── controller_init.yaml
│ ├── coordinator_init.yaml
│ ├── mindie_ms_controller.yaml
│ ├── mindie_ms_coordinator.yaml
│ ├── mindie_server.yaml
│ ├── mindie_server_heterogeneous.yaml
│ ├── mindie_service_single_container.yaml
│ ├── mindie_service_single_container_base_A3.yaml
│ ├── server_init.yaml
│ └── single_container_init.yaml
├── gen_ranktable_helper
│ ├── gen_global_ranktable.py
│ └── global_ranktable.json
├── generate_stream.sh
├── log.sh
├── user_config.json
├── user_config_base_A3.json
└── utils
├── validate_config.py
└── validate_utils.py
The following table describes the key directories and files for multi-node prefill-decode disaggregation deployment.
Directory/File |
Description |
|---|---|
conf |
Main service configuration file of the cluster management component and Server, which is used to manage scheduling policies and model configurations in the prefill-decode disaggregation scenario. |
boot_helper |
Contains the container startup script boot.sh. It is used to obtain group IDs, update environment variables to the configuration file, and set environment variables of the startup program. You can adjust the log level as required. NOTE:
MindIE depends on the jemalloc.so library file. Do not install invalid SO files with the same name in the /usr/ directory, as this may introduce security risks such as arbitrary command execution. |
deployment |
Defines a Kubernetes deployment task and configures the NPU resource usage, number of instances, and image. |
gen_ranktable_helper |
Tool for generating the global_ranktable.json file. |
chat.sh |
Simple dialog example of using curl to send HTTPS requests to the inference service, which is applicable to the prefix cache scenario. |
generate_stream.sh |
Streaming response example for sending an HTTPS request to the inference service using curl. |
deploy.sh |
Deployment script, which is used to start all MindIE components in one-click mode. |
delete.sh |
Uninstallation script, which is used to uninstall all MindIE components in one-click mode. |
log.sh |
Queries the printed logs of all deployed pods. |
Procedure
The LLaMA3-8B with four instances is used as an example, each instance configured with two NPUs. The following is a deployment example. Perform the following operations in the deployment script path.
The following steps do not distinguish homogeneous deployment from heterogeneous deployment. For heterogeneous deployment, the corresponding configuration is described separately.
- Log in to the host of the cluster's management node and create a MindIE namespace for the first deployment. The default value is mindie. Replace it as required.
kubectl create namespace mindie
- Configure the startup configuration file ms_controller.json of Controller. For details about the configuration file, see Configuration Description.
For prefill-decode disaggregation deployment, set deploy_mode to pd_separate.
"deploy_mode": "pd_separate"
If heterogeneous deployment is required, set is_heterogeneous to true.
"is_heterogeneous": true
- The request_coordinator_tls_enable, request_server_tls_enable, http_server_tls_enable, and cluster_tls_enable parameters in the ms_controller.json file specify whether to enable HTTPS. You are advised to enable HTTPS (by setting these parameters to true) to ensure communication security. If HTTPS is disabled, high network security risks exist.
- true: HTTPS enabled for MindIE components in the cluster. You need to import certificates to the container and configure the corresponding certificate paths.
- false: HTTP enabled for MindIE components in the cluster. You do not need to prepare certificates.
- default_p_rate and default_d_rate in the ms_controller.json file control the ratio of prefill nodes to decode nodes in the cluster. The default values are both 0. The optimal ratio is automatically determined based on the model, hardware, and service information. You can also set the ratio to the actual number of prefill nodes and decode nodes based on the scenario situation. If the environment variables MINDIE_MS_P_RATE and MINDIE_MS_D_RATE are set, their values are preferentially read. For details, see Environment Variable.
- The request_coordinator_tls_enable, request_server_tls_enable, http_server_tls_enable, and cluster_tls_enable parameters in the ms_controller.json file specify whether to enable HTTPS. You are advised to enable HTTPS (by setting these parameters to true) to ensure communication security. If HTTPS is disabled, high network security risks exist.
- Configure the startup configuration file ms_coordinator.json of Coordinator. For details about the configuration file, see Configuration Description.
For prefill-decode disaggregation deployment, set deploy_mode to pd_separate.
"deploy_mode": "pd_separate"
The controller_server_tls_enable, request_server_tls_enable, mindie_client_tls_enable, and mindie_mangment_tls_enable parameters in the ms_coordinator.json file specify whether to enable HTTPS. You are advised to enable HTTPS (by setting these parameters to true) to ensure communication security. If HTTPS is disabled, high network security risks exist.- true: HTTPS enabled for MindIE components in the cluster. You need to import certificates to the container and configure the corresponding certificate paths.
- false: HTTP enabled for MindIE components in the cluster. You do not need to prepare certificates.
- Configure the config.json file for starting the Server service. Table 2 lists the parameters need to be configured for the prefill-decode disaggregation deployment mode. For details about the parameters, see "Core Concepts and Configurations" > "Configuration Parameters (Serving)" in MindIE LLM Development Guide.
Table 2 Key parameters in config.json Parameter
Description
modelName
Model name, which is associated with the model weight file, for example, llama3-8b.
modelWeightPath
Path of the model weight file. By default, the script is mounted to the /data directory of the physical machine. modelWeightPath must be set to the model weight path in the /data directory to ensure that the cluster can schedule the Ascend compute node's model file existing in this path.
worldSize
Number of NPUs occupied by a prefill/decode instance. For example, if this parameter is set to 2, two NPUs are used.
npuDeviceIds
NPU ID, starting from 0. The total number of IDs is the same as the value of worldSize, for example, [[0,1]].
inferMode
Set this parameter to dmi.
tp
Number of tensor parallel processes on the entire network. This number is the value of worldSize. This parameter is a supplementary parameter and needs to be configured under the ModelConfig field.
- Configure the http_client_ctl.json configuration file, which is used to configure the HTTP client tool for cluster startup, liveness, and readiness probes. For details about the parameters, see Table 4.
tls_enable specifies whether to enable HTTPS. If the MindIE components in the cluster use the HTTPS interface, set tls_enable to true, import certificates to the container, and configure the corresponding certificate paths. To use the HTTP interface, set tls_enable to false. No certificate file is required.
You are advised to enable tls_enable to ensure communication security. If tls_enable is disabled, high network security risks exist.
- Configure the Kubernetes Deployment.
Find the mindie_server.yaml, mindie_ms_coordinator.yaml, mindie_ms_controller.yaml, and mindie_server_heterogeneous.yaml (configured only in heterogeneous scenarios) files in the deployment directory among the deployment script directories.
- The script is for reference only. You need to ensure the security of the pod container. In the actual production environment, harden the security of the image and pod.
- When using kubectl to deploy a Deployment, you can modify the YAML configuration file of the Deployment. Do not use dangerous configurations, and ensure that a secure image (non-root user) is used to configure secure pod contexts.
- You must mount secure paths (non-soft links, non-dangerous system paths, and non-service sensitive paths) and set proper directory permissions to prevent public directories such as /home from being mounted and prevent container escape caused by tampering by unauthorized users.
- Fields to be configured in the mindie_server.yaml file:
- replicas:
- Homogeneous scenario: total number of prefill and decode instances. If there are three prefill instances and one decode instance, set this parameter to 4.
- Heterogeneous scenario: number of prefill instances. If there are three prefill instances, set this parameter to 3.
- huawei.com/Ascend910: number of NPUs occupied by a prefill/decode instance. The value must be the same as that of worldSize in the config.json file of MindIE LLM.
- image: image name.
- MindIE and ATB Models used together: see MindIE and ATB Models Used Together.
- MindIE and MindSpore used together: see MindIE and MindSpore Used Together.
- startupProbe: startup probe. The default startup time is 500 seconds. If the service fails to be started within the time, the pod automatically restarts. Set a proper startup time as required.
- affinity: anti-affinity deployment configuration. By default, it is disabled. After the comment is deleted, it is enabled. After this function is enabled, each Server pod is deployed on different compute nodes in anti-affinity mode. Multiple pods are not deployed on the same node.
- nodeSelector: selects the node to be scheduled through node parameters.
- hardware_type: required only for heterogeneous deployment, which is used to enable Controller to determine the heterogeneous inference role.
- hardware-type: required only for heterogeneous deployment. It is the node label managed by the Kubernetes cluster and is used to allocate heterogeneous device resources. Before heterogeneous deployment, run the following command to label heterogeneous compute nodes:
kubectl label node xx_node hardware-type=xx_device
In heterogeneous scenarios, the parameter values must meet the following requirements:
- hardware_type: 800I A2(32G) for a prefill node, and 800I A2(64G) for a decode node.
- hardware-type: The value of this parameter cannot contain spaces.
- MINDIE_LOG_TO_FILE: whether to write logs of each MindIE component to a file.
The default value is 1, indicating that logs are written into a file. The value range is [false, true] or [0, 1].
- MINDIE_LOG_TO_STDOUT: whether to print logs of each MindIE component.
The default value is 1, indicating that logs are printed. The value range is [false, true] or [0, 1].
- MINDIE_LOG_LEVEL: log level of each MindIE component.
The default value is INFO. The log levels are [CRITICAL, ERROR, WARN, INFO, DEBUG].
- replicas:
- Pay special attention to the image name and log settings in the mindie_ms_coordinator.yaml and mindie_ms_controller.yaml files.
- MindIE and ATB Models used together: see MindIE and ATB Models Used Together.
- MindIE and MindSpore used together: see MindIE and MindSpore Used Together.
- MINDIE_LOG_TO_FILE: whether to write logs of each MindIE component to a file.
The default value is 1, indicating that logs are written into a file. The value range is [false, true] or [0, 1].
- MINDIE_LOG_TO_STDOUT: whether to print logs of each MindIE component.
The default value is 1, indicating that logs are printed. The value range is [false, true] or [0, 1].
- MINDIE_LOG_LEVEL: log level of each MindIE component.
The default value is INFO. The log levels are [CRITICAL, ERROR, WARN, INFO, DEBUG].
- Pay special attention to the livenessProbe parameter in the mindie_ms_coordinator.yaml file. For high-concurrency inference requests, the probe may time out and Kubernetes may identify that the pod is not alive. As a result, Kubernetes restarts the Coordinator container. Exercise caution when enabling livenessProbe.
- To enable the fault recovery function of MindIE Controller, the directory specified by name: status-data under the volumes parameter in the mindie_ms_controller.yaml file must exist on the compute node to be deployed (specified by nodeSelector). The mount path of name: status-data under volumeMounts must be /MindIE Motor installation path/logs.
- Fields to be configured in the mindie_server_heterogeneous.yaml file:
- replicas: number of decode instances. If there are four decode instances, set this parameter to 4.
- huawei.com/Ascend910: number of NPUs occupied by a prefill/decode instance. If there are two NPUs, set this parameter to 2.
- image: image name.
- MindIE and ATB Models used together: see MindIE and ATB Models Used Together.
- MindIE and MindSpore used together: see MindIE and MindSpore Used Together.
- startupProbe: startup probe. The default startup time is 500 seconds. If the service fails to be started within the time, the pod automatically restarts. Set a proper startup time as required.
- nodeSelector: selects the node to be scheduled through node parameters.
- hardware_type: required only for heterogeneous deployment, which is used to enable Controller to determine the heterogeneous inference role. The value of this parameter must be different from that of hardware_type in the mindie_server.yaml file.
- hardware-type: required only for heterogeneous deployment. It is the node label managed by the Kubernetes cluster and is used to allocate heterogeneous device resources. Before heterogeneous deployment, run the following command to label heterogeneous compute nodes. The value of this parameter must be different from that of hardware-type in the mindie_server.yaml file.
kubectl label node xx_node hardware-type=xx_device2
In heterogeneous scenarios, the parameter values must meet the following requirements:
- hardware_type: 800I A2(32G) for a prefill node, and 800I A2(64G) for a decode node.
- hardware-type: The value of this parameter cannot contain spaces.
- MINDIE_LOG_TO_FILE: whether to write logs of each MindIE component to a file.
The default value is 1, indicating that logs are written into a file. The value range is [false, true] or [0, 1].
- MINDIE_LOG_TO_STDOUT: whether to print logs of each MindIE component.
The default value is 1, indicating that logs are printed. The value range is [false, true] or [0, 1].
- MINDIE_LOG_LEVEL: log level of each MindIE component.
The default value is INFO. The log levels are [CRITICAL, ERROR, WARN, INFO, DEBUG].
- Configure the boot.sh script. For details about the environment variables that can be configured, see Table 3.
Table 3 Environment variables Environment Variable
Type
Description
MINDIE_INFER_MODE
Prefill-decode disaggregation
Inference mode, indicating whether prefill-decode disaggregation is enabled.
- standard: prefill-decode hybrid deployment
- dmi: prefill-decode disaggregation
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_DECODE_BATCH_SIZE
Public variable
Maximum decode batch size.
Value range: [1, 5000]
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_PREFILL_BATCH_SIZE
Public variable
Maximum prefill batch size.
Value range: [1, MINDIE_DECODE_BATCH_SIZE - 1]
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_MAX_SEQ_LEN
Public variable
Maximum sequence length.
Integer in the range of (0, 4294967295]
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_MAX_ITER_TIMES
Public variable
Maximum output length.
Integer in the range of [1, MINDIE_MAX_SEQ_LEN – 1]
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_MODEL_NAME
Public variable
Model name.
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_MODEL_WEIGHT_PATH
Public variable
Path of the model weight file.
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_ENDPOINT_HTTPS_ENABLED
Public variable
Whether to enable HTTPS on the prefill/decode instance.
- true: enabled
- false: disabled
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_INTER_COMM_TLS_ENABLED
Public variable
Whether to enable TLS for communication between inference instances.
- true: enabled
- false: disabled
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
HCCL_RDMA_RETRY_CNT
Public variable
Number of retransmission times of the RDMA NIC. The value must be an integer ranging from 1 to 7. The default value is 7.
HCCL_RDMA_TIMEOUT
Public variable
Retransmission timeout of the RDMA NIC.
The formula for calculating the minimum retransmission timeout of the RDMA NIC is 4.096 μs * 2 ^ timeout. In the formula, timeout is the value of this environment variable, and the actual retransmission timeout is related to the user network status.
Set this environment variable to an integer ranging from [5,20]. The default value is 18.
HCCL_EXEC_TIMEOUT
Public variable
Synchronization wait time during task execution between devices. Within this configured time, each device process waits for other devices to perform communication synchronization. This variable is used to set the timeout interval of the first token.
The value range is [0, 2147483647], in seconds. The default value is 60. The value 0 indicates that there is no timeout.
HSECEASY_PATH
Public variable
Path of the dependency library of the KMC decryption tool.
MINDIE_MS_CONTROLLER_CONFIG_FILE_PATH
Public variable
Configuration file path of the Controller component.
MINDIE_MS_COORDINATOR_CONFIG_FILE_PATH
Public variable
Configuration file path of the Coordinator component.
Note: For details about log-related environment variables, see Log Configuration.
- Start the prefill-decode cluster.
Configure the MindIE installation directory in the container. Change the value of MINDIE_USER_HOME_PATH based on the actual installation path during image creation. For example, if the installation path is /xxx/Ascend/mindie, set the value to /xxx.
export MINDIE_USER_HOME_PATH={Image installation path}
Run the following command to start the cluster:
bash deploy.sh
After the command is executed, wait until global_ranktable.json is generated. If the generation is blocked for a long time, press Ctrl+C to interrupt the process, and check the pod status of the cluster for debugging.
The following is an example of global_ranktable.json. For details about the parameters, see Table 4.
{ "version": "1.0", "server_group_list": [ { "group_id": "2", "server_count": "2", "server_list": [ { "server_id": "xxx.xxx.xxx.1", "server_ip": "xxx.xxx.xxx.1", "device": [ { "device_id": "0", "device_ip": "xxx.xxx.xxx.1", "device_logical_id": "0" } ], "hardware_type": "800I A2(32G)" }, { "server_id": "xxx.xxx.xxx.2", "server_ip": "xxx.xxx.xxx.2", "device": [ { "device_id": "1", "device_ip": "xxx.xxx.xxx.2", "device_logical_id": "1" } ], "hardware_type": "800I A2(64G)" } ] }, { "group_id": "1", "server_count": "1", "server_list": [ { "server_id": "xxx.xxx.xxx.1", "server_ip": "xxx.xxx.xxx.1" } ] }, { "group_id": "0", "server_count": "1", "server_list": [ { "server_id": "xxx.xxx.xxx.1", "server_ip": "xxx.xxx.xxx.1" } ] } ], "status": "completed" }Table 4 Parameters in the global_ranktable.json file Parameter
Type
Description
version
String
Version number of Ascend Operator.
status
String
Cluster information status.
- completed: deployment completed
- initializing
group_id
String
ID of each component.
- 0: Coordinator deployment information
- 1: Controller deployment information
- 2: Server deployment information
server_count
String
Total number of nodes of each component.
server_list
JSON object array
Node deployment information of each component.
- A maximum of one Controller instance can be contained. The valid list length is [0, 1].
- A maximum of one Coordinator instance can be contained. The valid list length is [0, 1].
- A maximum of 96 Server instances can be included. The valid list length is [0, 96].
server_id
String
Node host IP address of a component.
server_ip
String
Node IP address of a component.
device
JSON object array
NPU device information. Only Server has this attribute. The valid list length is [1, 128].
device_id
String
NPU device ID.
device_ip
String
NPU IP address.
device_logical_id
String
Logical ID of the NPU, that is, the sequence ID of the visible device in the pod where Server is located.
hardware_type
String
Hardware type. This attribute is available only in heterogeneous mode.
- 800I A2 (32G) for a prefill node
- 800I A2 (64G) for a decode node
- If the deployment fails, delete the cluster by referring to 11 and deploy it again.
- By default, the cluster updates the ConfigMap mounted to the container every 60 seconds. If it takes long to print the message "status of ranktable is not completed" in the container, you can change the interval for synchronizing the ConfigMap by kubelet on each compute node to be scheduled. That is, modify the syncFrequency parameter in the /var/lib/kubelet/config.yaml file to reduce the period to 5 seconds. Note that this modification may affect the cluster performance.
syncFrequency: 5s
Restart kubelet:
swapoff -a systemctl restart kubelet.service systemctl status kubelet
- Ensure that Docker is configured with the maximum specifications for writing standard output streams to files to prevent pods from being evicted due to full drive space.
After modifying the Docker configuration file on the compute node where the service is to be deployed, restart Docker.
- Open the daemon.json file.
vim /etc/docker/daemon.json
Add log-opts to the daemon.json file as follows."log-opts":{"max-size":"500m", "max-file":"3"}Parameters:
max-size=500m indicates that the maximum size of a container log file is 500 MB.
max-file=3 indicates that a container has a maximum of three log files. If the number of log files exceeds three, the log files are automatically rotated.
- Restart Docker.
systemctl daemon-reload systemctl restart docker
- Open the daemon.json file.
To enable heterogeneous cluster inference, run the following command:
bash deploy.sh heter
- Run the kubectl command to check the prefill-decode cluster status.
kubectl get pods -n mindie
For example, if four Server instances are started, the following information is displayed:
1 2 3 4 5 6 7
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES mindie-ms-controller-7845dcd697-h4gw7 1/1 Running 0 145m xx.xx.xx.xx ubuntu10 <none> <none> mindie-ms-coordinator-6bff995ff8-l6fwz 1/1 Running 0 145m xx.xx.xx.xx ubuntu10 <none> <none> mindie-server-7b795f8df9-2xvh4 1/1 Running 0 145m xx.xx.xx.xx ubuntu <none> <none> mindie-server-7b795f8df9-j4z7d 1/1 Running 0 145m xx.xx.xx.xx ubuntu <none> <none> mindie-server-7b795f8df9-v2tcz 1/1 Running 0 145m xx.xx.xx.xx ubuntu <none> <none> mindie-server-7b795f8df9-vl9hv 1/1 Running 0 145m xx.xx.xx.xx ubuntu <none> <none>
- mindie-ms-controller: Controller
- mindie-ms-coordinator: Coordinator
- mindie-server: Server
If the pod is in the Running status, the pod container has been successfully scheduled to a node and started. However, you need to further check whether the service program is started successfully.
- You can use the provided log.sh script to query the standard output logs of pods and check whether a program exception occurs.
bash log.sh
If heterogeneous cluster inference is configured, run the following command:
bash log.sh heter
- To query the logs of a specific pod (for example, mindie-server-7b795f8df9-vl9hv), run the following command:
kubectl logs mindie-server-7b795f8df9-vl9hv -n mindie
- To access the container to search for more information, run the following command:
kubectl exec -it mindie-server-7b795f8df9-vl9hv -n mindie -- bash
- To confirm the pods corresponding to prefill and decode nodes, run the following command after Controller is started (for example, mindie-ms-controller-7845dcd697-h4gw7 is in the READY 1/1 state):
kubectl logs mindie-ms-controller-7845dcd697-h4gw7 -n mindie | grep UpdateServerInfo
After the pod IP addresses of the prefill and decode nodes are queried, you can find the corresponding pods based on the IP addresses in the command output of the preceding command for querying the pod status.
- Use the provided generate_stream.sh script to initiate a streaming inference request.After the deployment is successful, port 31015 is opened on the node for the inference service interface. You need to change the IP address in generate_stream.sh to the IP address of the host IP address of the cluster management node. If HTTPS is enabled for Coordinator, you need to configure related certificates. If HTTP is used, change HTTPS in the script to HTTP and delete the certificate configuration.
bash generate_stream.sh
- Delete the prefill-decode cluster.To stop the prefill-decode service or modify the service configuration for instance redeployment, run the following command to delete the deployed instance. Then, redeploy an instance by referring to 8.
bash delete.sh mindie ./
- mindie indicates the namespace created in 1. Replace it as required.
- The uninstallation script delete.sh must be executed in the examples/kubernetes_deploy_scripts directory. Otherwise, the service cannot be stopped and an error is reported.