Deploying a Service Using kubectl

This mode supports the deployment of Server, Coordinator, and Controller in single-node scenarios.
For the Atlas 800I A2 inference server, you need to configure the NPU IP address and install Ascend Operator. And for Atlas 300I Duo inference card+Atlas 800 inference server (model 3000), skip these operations.
The current deployment script does not support rescheduling upon NPU faults.

This section uses the script in the installation directory (./mindie-service/latest/examples/kubernetes_deploy_scripts) of the MindIE Motor RUN package to call kubectl to deploy services. The script provides one-click deployment and uninstallation of single-node clusters. The cluster administrator can use the Kubernetes kubectl to deploy services offline by referring to related scripts.

The cluster administrator only needs to compile the startup script, configure services and Kubernetes on the management node, and then call the deployment script to automatically deliver the service configuration and startup script, to automatically generate a ranktable containing node information and schedule pods to compute nodes.

Directory structure of scripts:

├── boot_helper
│   ├── boot.sh
│   ├── gen_config_single_container.py
│   ├── get_group_id.py
│   └── update_mindie_server_config.py
├── chat.sh
├── conf
├── delete.sh
├── deployment
│   ├── mindie_ms_controller.yaml
│   ├── mindie_ms_coordinator.yaml
│   ├── mindie_server_heterogeneous.yaml
│   ├── mindie_server.yaml
│   └── mindie_service_single_container.yaml
├── deploy.sh
├── generate_stream.sh
├── gen_ranktable_helper
│   ├── gen_global_ranktable.py
│   └── global_ranktable.json
└── log.sh

Description of the key directories and files:

boot_helper: contains the container startup script boot.sh. It is used to obtain group IDs, update environment variables to the configuration file, and set environment variables of the startup program. You can adjust the log level as required.
chat.sh: simple dialog example for using curl to send HTTP requests to the inference service.
conf: main service configuration file of the cluster management component and Server, which is used to manage scheduling policies and model configurations in the prefill-decode disaggregation scenario.
delete.sh: uninstallation script, which is used to uninstall all MindIE components in one-click mode.
deployment: defines a Kubernetes deployment task and configures the NPU resource usage, number of instances, and image.
deploy.sh: deployment script, which is used to start all MindIE components in one-click mode.
generate_stream.sh: streaming response example for sending an HTTPS request to the inference service using curl.
gen_ranktable_helper: tool for generating a global ranktable, which does not need to be perceived by users.
log.sh: queries the printed logs of all deployed pods.

Prerequisites

You have installed and configured Kubernetes, installed MindCluster components, and created a MindIE image by referring to Kubernetes Installation and Configuration, MindCluster Installation, and MindIE Image Preparation.

Procedure

The following is a deployment example. Perform the following operations in the deployment script path.

Create a MindIE namespace. The default value is mindie. Replace it as required.
```
kubectl create namespace mindie
```
Configure Controller and Coordinator of the cluster management component and set the deployment mode of the two components to single-node (non-distributed) service deployment. The required parameters are as follows.
- Configure the ms_controller.json file. For details about the parameters, see Configuration Description.
  To set the single-node (non-distributed) service deployment mode, set the following parameter:
```
"deploy_mode"= "single_node"
```
- Configure the ms_coordinator.json file. For details about the parameters, see Configuration Description.
  - To set the single-node (non-distributed) service deployment mode, set the following parameter:
```
"deploy_mode"= "single_node"
```
  - To configure the cache affinity scheduling for OpenAI multi-round sessions, set the following parameters:
```
"scheduler_type": "default_scheduler",
"algorithm_type": "cache_affinity",
```
Configure the config.json file for starting the Server service. The following parameters need to be configured for the single-node (non-distributed) service deployment mode. For details about the parameters, see "Core Concepts and Configurations" > "Configuration Parameters (Serving)" in MindIE LLM Development Guide.
- modelWeightPath: directory of the model weight file. By default, the script is mounted to the /data directory of the physical machine. This parameter must be set to the model weight path in the /data directory to ensure that the model file exists in this path of the compute node that can be scheduled in the cluster.
- worldSize: number of NPUs occupied by an instance. For example, if the value is 2, two NPUs are used.
- npuDeviceIds: NPU ID, starting from 0. The total number of IDs is the same as the value of worldSize, for example, [[0,1]].
- inferMode: Set it to standard.
- Enabling prefix cache:
  - Add the following configuration under ModelConfig in ModelDeployConfig:
```
"plugin_params": "{\"plugin_type\":\"prefix_cache\"}"
```
  - Add the following configuration to ScheduleConfig:
```
"enablePrefixCache": true
```
Configure the HTTP client tool configuration file http_client_ctl.json for cluster startup, liveness, and readiness probes.
tls_enable specifies whether to use HTTPS.
- true: HTTPS enabled for MindIE components in the cluster. You need to import certificates to the container and configure the corresponding certificate paths.
- false: HTTP enabled for MindIE components in the cluster. You do not need to prepare certificates.
You are advised to enable tls_enable to ensure communication security. If tls_enable is disabled, high network security risks exist.
Configure the Kubernetes Deployment.
Find the mindie_server.yaml, mindie_ms_coordinator.yaml, and mindie_ms_controller.yaml files under the deployment directory of the deployment script directory.
- The script is for reference only. You need to ensure the security of the pod container. In the actual production environment, harden the security of the image and pod.
- When using kubectl to deploy a Deployment, you need to modify the YAML configuration file of the Deployment. Do not use dangerous configurations, and ensure that a secure image (non-root user) is used to configure secure pod contexts.
- You must mount secure paths (non-soft links, non-dangerous system paths, and non-service sensitive paths) and set proper directory permissions to prevent public directories such as /home from being mounted and prevent container escape caused by tampering by unauthorized users.
- Fields to be configured in the mindie_server.yaml file:
  - replicas: total number of instances.
  - huawei.com/Ascend910(310P): resource request, which specifies the number of 910 (310P) NPUs occupied by an instance. The value must be the same as that specified by worldSize in the config.json file of MindIE Server.
  - image: image name.
  - nodeSelector: selects a user-expected node, which is implemented by node labels.
  - ring-controller.atlas: ascend-910b or ascend-310p, which is determined by the actual device model.
  - startupProbe: startup probe. The default startup time is 500 seconds. If the service fails to be started within the time, the pod automatically restarts. Set a proper startup time as required.
- Fields to be configured in the mindie_ms_coordinator.yaml and mindie_ms_controller.yaml files:
  - image: image name.
  - nodeSelector: selects a user-expected node, which is implemented by node labels.
  Pay special attention to the livenessProbe parameter in the mindie_ms_coordinator.yaml file. For high-concurrency inference requests, the probe may time out and Kubernetes may identify that the pod is not alive. As a result, Kubernetes restarts the Coordinator container. Exercise caution when enabling livenessProbe.
  To enable the fault recovery function of MindIE Controller, the directory specified by name: status-data under the volumes parameter in the mindie_ms_controller.yaml file must exist on the compute node to be deployed (specified by nodeSelector). The mount path of name: status-data under volumeMounts must be /MindIE Motor installation path/logs.
Start the cluster.
Configure the MindIE installation directory in the container. Change the value of MINDIE_USER_HOME_PATH based on the actual installation path during image creation. For example, if the installation path is /xxx/Ascend/mindie, set the value to /xxx.
```
export MINDIE_USER_HOME_PATH={Image installation path}
```
Run the following command to start the cluster:
```
bash deploy.sh
```
After the command is executed, wait until the global ranktable is generated. If the generation is blocked for a long time, press Ctrl+C to interrupt the process, and check the pod status of the cluster for debugging.
- By default, the cluster updates the ConfigMap mounted to the container every 60 seconds. If it takes long to print the message "status of ranktable is not completed" in the container, you can change the interval for synchronizing the ConfigMap by kubelet on each compute node to be scheduled. That is, modify the syncFrequency parameter in the /var/lib/kubelet/config.yaml file to reduce the period to 5 seconds. Note that this modification may affect the cluster performance.
  syncFrequency: 5s
  
  Restart kubelet:
  
  swapoff -a systemctl restart kubelet.service systemctl status kubelet
- Ensure that Docker is configured with the maximum specifications for writing standard output streams to files to prevent pods from being evicted due to full drive space.
  After modifying the Docker configuration file on the compute node where the service is to be deployed, restart Docker.
  
  Open the daemon.json file.
  vim /etc/docker/daemon.json
  
  Add log-opts to the daemon.json file as follows.
  "log-opts":{"max-size":"500m", "max-file":"3"}
  
  Parameters:
  
  max-size=500m indicates that the maximum size of a container log file is 500 MB.
  
  max-file=3 indicates that a container has a maximum of three log files. If the number of log files exceeds three, the log files are automatically rotated.
  
  Restart Docker.
  systemctl daemon-reload systemctl restart docker

Check the cluster status.

Run the kubectl command to check the pod status.

kubectl get pods -n mindie

For example, if four Server instances are started, the following information is displayed:

NAME                                     READY   STATUS    RESTARTS   AGE    IP               NODE       NOMINATED NODE   READINESS GATES
mindie-ms-controller-7845dcd697-h4gw7    1/1     Running   0          145m   xxx.xxx.xxx    ubuntu10   <none>           <none>
mindie-ms-coordinator-6bff995ff8-l6fwz   1/1     Running   0          145m   xxx.xxx.xxx    ubuntu10   <none>           <none>
mindie-server-7b795f8df9-2xvh4           1/1     Running   0          145m   xxx.xxx.xxx   ubuntu     <none>           <none>
mindie-server-7b795f8df9-j4z7d           1/1     Running   0          145m   xxx.xxx.xxx   ubuntu     <none>           <none>
mindie-server-7b795f8df9-v2tcz           1/1     Running   0          145m   xxx.xxx.xxx   ubuntu     <none>           <none>
mindie-server-7b795f8df9-vl9hv           1/1     Running   0          145m   xxx.xxx.xxx   ubuntu     <none>           <none>

mindie-ms-controller: Controller of the cluster management component
mindie-ms-coordinator: Coordinator of the cluster management component
mindie-llm: Server inference service

If the pod is in the Running status, the pod container has been successfully scheduled to a node and started. However, you need to further check whether the service program is started successfully.

You can use the provided log.sh script to query the standard output logs of pods and check whether an exception occurs.

bash log.sh

To query the logs of a specific pod (for example, mindie-server-7b795f8df9-2xvh4), run the following command:
```
kubectl logs mindie-server-7b795f8df9-2xvh4 -n mindie
```

To access the container to search for more information, run the following command:
```
kubectl exec -it mindie-server-7b795f8df9-2xvh4 -n mindie -- bash
```

Use the provided chat.sh script to initiate an inference request.
Change the IP address in chat.sh to the IP address of the host of the cluster management node and set role to user.
```
bash chat.sh
```
Delete the cluster.
To stop the single-node service or modify the service configuration for instance redeployment, run the following command to delete the deployed instance. Then, redeploy an instance by referring to 6.
```
bash delete.sh
```
mindie indicates the namespace created in 1. Replace it as required.

Parent topic: Installation and Deployment