Deploying Inference Jobs Using a Script in One-Click Mode

If multiple associated inference jobs are deployed in the Kubernetes cluster, manually compiling and maintaining a large number of Kubernetes YAML files is inefficient and error-prone. To solve this problem, MindCluster provides an automatic script to replace complex manual operations. You only need to provide basic information, such as the application name, image version, and number of replicas, and the script automatically generates all necessary Kubernetes YAML files that comply with specifications and deploys them to the specified cluster. In addition, MindCluster provides an easy way, such as specifying a common application name, to remove all associated resources at once.

Currently, the script supports only the prefill-decode disaggregation deployment, meaning it can start multiple prefill/decode instances, routers, and Memfabric_Store servers at a time.

Prerequisite

Python has been installed in the environment, and dependency packages can be downloaded online.
The KubeConfig file exists and can communicate with the Kubernetes cluster properly.
MindCluster and OME have been deployed.
The required Base Model and Serving Runtime resources have been deployed.

Procedure

Git clone the source code from the mindcluster-deploy repository and go to the k8s-deploy-tool directory.
```
git clone https://gitcode.com/Ascend/mindcluster-deploy.git && cd mindcluster-deploy/k8s-deploy-tool
```
(Optional) Create and activate a virtual Python environment. This operation allows different Python projects to use different library versions without interference.
```
python -m venv venv && source venv/bin/activate
```
Use Python or Python3 as required.
Install the dependencies.
```
pip install -r requirements.txt
```
(Optional) Deploy Serving Runtime (for testing only). You can deploy the corresponding Serving Runtime based on the job requirements.
```
kubectl apply -f example/ome-runtimes/
```
Edit the config/isvc-config.yaml file.
1. Open config/isvc-config.yaml.
```
vi config/isvc-config.yaml
```
2. Press i to enter the insert mode and modify the fields in the file as required.
3. Press Esc, type :wq!, and press Enter to save the changes and exit.
(Optional) Create a job namespace. xxx is app_namespace set in config/isvc-config.yaml. If app_namespace is set to default or is not set, you do not need to create a namespace.
```
kubectl create ns xxx
```
(Optional) Set a service framework. Currently, ome and aibrix are supported. If this parameter is not set, ome is used by default.
```
export SERVING_FRAMEWORK=ome
```
Deploy inference jobs.
```
python main.py deploy -c config/isvc-config.yaml
```
Use Python or Python3 as required. The parameters are described as follows:
- -c, --config: (Mandatory) configuration file path.
- -k, --kubeconfig: (Optional) KubeConfig file path. The default value is ~/.kube/config.
- --dry-run: (Optional) trial run (this parameter is not deployed actually, and is used to display the generated YAML file).
Check job running status.
```
python main.py status -n my-test -ns default
```
The parameters are described as follows:
- -n, --app-name: (Mandatory) application name. my-test is app_name set in config/isvc-config.yaml.
- -ns, --namespace: (Optional) application namespace. The default value is default.
- -k, --kubeconfig: (Optional) KubeConfig file path. The default value is ~/.kube/config.
You can also use the kubectl CLI to view the job running status.
Create a terminal window and run the following command on the node in the Kubernetes cluster to access the inference service. If the request is successfully returned, the inference service is deployed.
```
curl --location 'http://<router-podip>:<router-port>/generate' --header 'Content-Type: application/json' --data '{
"text": "Who are you",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 20
},
"stream": true
}'
```
- <router-podip> indicates the Router pod's IP address. You can run the following command to view the IP address:
```
kubectl get pod -A -o wide
```
- <router-port> indicates the service port set for the Router in Serving Runtime.
(Optional) Delete inference jobs by running the following command.
```
python main.py delete -n my-test 
```
Use Python or Python3 as required. The parameters are described as follows:
- -n, --app-name: (Mandatory) application name.
- -ns, --namespace: (Optional) application namespace. The default value is default.
- -k, --kubeconfig: (Optional) KubeConfig file path. The default value is ~/.kube/config.

Parent topic: Deploying an OME-based SGLang Inference Job