Deploying Inference Jobs Using a Script in One-Click Mode

If multiple associated inference jobs are deployed in the Kubernetes cluster, manually compiling and maintaining a large number of Kubernetes YAML files is inefficient and error-prone. To solve this problem, MindCluster provides an automatic script to replace complex manual operations. You only need to provide basic information, such as the application name, image version, and number of replicas, and the script automatically generates all necessary Kubernetes YAML files that comply with specifications and deploys them to the specified cluster. In addition, MindCluster provides an easy way, such as specifying a common application name, to remove all associated resources at once.

The current script supports only prefill-decode disaggregation architecture.

Prerequisites

MindCluster and AIBrix components have been installed.
Python has been installed in the environment, and dependency packages can be downloaded online.
The KubeConfig file exists and can communicate with the Kubernetes cluster properly.

Procedure

Git clone the source code from the mindcluster-deploy repository and go to the k8s-deploy-tool directory.
```
git clone https://gitcode.com/Ascend/mindcluster-deploy.git && cd mindcluster-deploy/k8s-deploy-tool
```
(Optional) Create and activate a virtual Python environment. This operation allows different Python projects to use different library versions without interference.
```
python -m venv k8s-deploy-tool && source k8s-deploy-tool/bin/activate
```
Use Python or Python3 as required.
Install the dependencies.
```
pip install -r requirements.txt
```
(Optional) Modify the instance startup script according to the specific requirements of your model.
1. Open the example/scripts/start_server.sh file.
```
vi example/scripts/start_server.sh
```
2. Press i to enter the insert mode and modify the vLLM process startup command, for example, max-model-len and max-num-batched-tokens, according to model requirements.
3. Press Esc, type :wq!, and press Enter to save the changes and exit.
(Optional) Copy the startup script to another directory on the host or another node in the cluster. This step can be skipped in single-node environments. If your system utilizes shared storage, place the script in a shared volume and mount it directly to the inference service.
The default proxy script in the scripts folder enables fault isolation. If this feature is not required, substitute it with the native proxy script.
```
cp example/scripts/*  <target_dir> 
scp example/scripts/* <user>@<IP>:<target_dir>
```

(Optional) Edit the YAML template to configure the mount paths of the model and script as required.

Open the src/templates/aibrix/stormservice.yaml.j2 file.
```
vi src/templates/aibrix/stormservice.yaml.j2
```

Press i to enter the insert mode and change the model storage directory in the container.

volumeMounts:
- name: model
mountPath: /mnt/models
volumes:                   # Modify the mounted volume.
- name: model             # Use the actual model storage directory.
hostPath:
path: /mnt/models
- name: scripts          # Use the actual startup script storage directory.
hostPath:
path: /scripts

Press Esc, type :wq!, and press Enter to save the changes and exit.

Edit the config/stormservice-config.yaml file.
1. Open config/stormservice-config.yaml.
```
vi config/stormservice-config.yaml
```
2. Press i to enter the insert mode and modify the fields in the file as required.
3. Press Esc, type :wq!, and press Enter to save the changes and exit.
- The value of dp_size must be an integer multiple of the value of podGroupSize.
- When dp_size is set to 1, distributed_dp can only be false. If dp_size is greater than 1, distributed_dp can be set to true.
(Optional) Create a job namespace. vllm-test is the value of app_namespace set in config/stormservice-config.yaml. If app_namespace is set to default or is not set, you do not need to create a namespace.
```
kubectl create ns vllm-test
```
Set the service framework to aibrix.
```
export SERVING_FRAMEWORK=aibrix
```
Deploy inference jobs.
```
python main.py deploy -c config/stormservice-config.yaml
```
Use Python or Python3 as required. The parameters are described as follows:
- -c, --config: (Mandatory) configuration file path.
- -k, --kubeconfig: (Optional) KubeConfig file path. The default value is ~/.kube/config.
- --dry-run: (Optional) trial run (this parameter is not deployed actually, and is used to display the generated YAML file).
Check job running status.
```
python main.py status -n my-test -ns default
```
The parameters are described as follows:
- -n, --app-name: (Mandatory) application name.
- -ns, --namespace: (Optional) application namespace. The default value is default.
- -k, --kubeconfig: (Optional) KubeConfig file path. The default value is ~/.kube/config.
You can also use the kubectl CLI to view the job running status.
Create a terminal window and run the following command on the node in the Kubernetes cluster to access the inference service. If the request is successfully returned, the inference service is deployed.
```
curl http://<routing-podip>:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<Model name>",
"prompt": "Who are you?",
"max_tokens": 10,
"temperature": 0
}'
```
<routing-podip> indicates the Routing pod's IP address. You can run the following command to view the IP address:
```
kubectl get pod -A -o wide
```
(Optional) Delete inference jobs by running the following command.
```
python main.py delete -n my-test -ns default
```
The parameters are described as follows:
- -n, --app-name: (Mandatory) application name.
- -ns, --namespace: (Optional) application namespace. The default value is default.
- -k, --kubeconfig: (Optional) KubeConfig file path. The default value is ~/.kube/config.

Parent topic: Deploying vLLM Inference Jobs