Preparing a Job YAML File
Prepare for image creation as required, select a YAML file, and modify the YAML file.
Prerequisites
You have prepared for image creation. For details about how to obtain the vLLM inference image, see the official vLLM-Ascend documentation.
YAML Selection
Currently, the AIBrix-based vLLM-Ascend inference job is deployed by StormService using a custom CRD. For details about how to use and deploy StormService, see the Aibrix StormService documentation. For details about the YAML example of StormService, click here.
All AIBrix examples are natively configured for GPU environments. If you use NPUs, these examples must be adapted accordingly. The following provides a reference for NPU adaptation, which can be tailored to your specific requirements.
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
name: "my-test"
namespace: "default"
spec:
replicas: 1 # Fixed to 1.
updateStrategy:
type: "InPlaceUpdate"
stateful: true
selector:
matchLabels:
app: "my-test"
template:
metadata:
labels:
app: "my-test"
spec:
roles:
- name: "prefill" # Prefill definition
replicas: 1 # Number of prefill replicas
podGroupSize: 1 # Number of prefill pod replicas
stateful: true # Fixed to be true.
template:
metadata:
labels:
model.aibrix.ai/name: "qwen3-moe" # Label required by AIBrix. Set it as required.
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: "vllm"
fault-scheduling: "force" # Enable rescheduling.
pod-rescheduling: "on" # If podGroupSize is set to 1, pod-rescheduling must be set to on. If podGroupSize is greater than 1, remove this parameter.
annotations:
huawei.com/schedule_minAvailable: "1" # Minimum number of scheduling replicas in the Gang scheduling policy. Scheduling in StormService is governed by the PodGroup logic. Instances with podGroupSize = 1 form one PodGroup for scheduling, and the schedulable replica count ranges from 1 to sum of all instance replicas (recommended). Conversely, each instance with podGroupSize > 1 forms an independent PodGroup, and the schedulable replica count ranges from 1 to podGroupSize (recommended). For example, for a prefill instance with podGroupSize = 1 and a decode instance with podGroupSize = 2, the minimum number of schedulable replicas of the prefill instance is its number of replicas, and the minimum number of schedulable replicas of the decode instance is equal to its podGroupSize.
huawei.com/recover_policy_path: "pod" # Path for job execution recovery when pod-rescheduling is set to on. If this parameter is set to "pod", job-level rescheduling is not triggered when pod-level rescheduling fails. Because each pod in the current PodGroup is an independent instance, fault handling cannot be propagated to other instances.
spec:
schedulerName: volcano # Set the scheduler to Volcano.
nodeSelector:
accelerator-type: "module-a3-16-super-pod" # Set it based on the hardware form.
containers:
- name: prefill
image: vllm-ascend:xxx # Image name
...
resources:
limits:
"huawei.com/Ascend910": 16 # Number of NPUs
requests:
"huawei.com/Ascend910": 16
...
- name: decode # Decode definition
replicas: 1 # Number of decode replicas
podGroupSize: 2 # Number of decode pod replicas
stateful: true
template:
metadata:
labels:
model.aibrix.ai/name: "qwen3-moe"
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
fault-scheduling: "force" # Enable rescheduling.
annotations:
huawei.com/schedule_minAvailable: "2" # For details, see the prefill instance parameter description.
spec:
schedulerName: volcano
nodeSelector:
accelerator-type: "module-a3-16-super-pod"
containers:
- name: decode
image: vllm-ascend:xxx
...
resources:
limits:
"huawei.com/Ascend910": 16 # Number of NPUs
requests:
"huawei.com/Ascend910": 16
...
- name: routing # Routing definition
replicas: 1 # Number of routing replicas
stateful: true
template:
spec:
containers:
- name: router
image: xxx:yyy # Routing image
...
Parent topic: Use on the CLI