Preparing a Job YAML File

Prepare for image creation as required, select a YAML file, and modify the YAML file.

Prerequisites

You have prepared for image creation. For details about how to obtain the vLLM inference image, see the official vLLM-Ascend documentation.

YAML Selection

Currently, the AIBrix-based vLLM-Ascend inference job is deployed by StormService using a custom CRD. For details about how to use and deploy StormService, see the Aibrix StormService documentation. For details about the YAML example of StormService, click here.

All AIBrix examples are natively configured for GPU environments. If you use NPUs, these examples must be adapted accordingly. The following provides a reference for NPU adaptation, which can be tailored to your specific requirements.

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: "my-test"
  namespace: "default"
spec:
  replicas: 1               # Fixed to 1.
  updateStrategy:
    type: "InPlaceUpdate"
  stateful: true
  selector:
    matchLabels:
      app: "my-test"
  template:
    metadata:
      labels:
        app: "my-test"
    spec:
      roles:
        - name: "prefill"          # Prefill definition
          replicas: 1           # Number of prefill replicas
          podGroupSize: 1        # Number of prefill pod replicas
          stateful: true        # Fixed to be true.
          template:
            metadata:
              labels:
                model.aibrix.ai/name: "qwen3-moe"  # Label required by AIBrix. Set it as required.
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: "vllm"
                fault-scheduling: "force"          # Enable rescheduling.
                pod-rescheduling: "on"          # If podGroupSize is set to 1, pod-rescheduling must be set to on. If podGroupSize is greater than 1, remove this parameter.
              annotations:
               huawei.com/schedule_minAvailable: "1" # Minimum number of scheduling replicas in the Gang scheduling policy. Scheduling in StormService is governed by the PodGroup logic. Instances with podGroupSize = 1 form one PodGroup for scheduling, and the schedulable replica count ranges from 1 to sum of all instance replicas (recommended). Conversely, each instance with podGroupSize > 1 forms an independent PodGroup, and the schedulable replica count ranges from 1 to podGroupSize (recommended). For example, for a prefill instance with podGroupSize = 1 and a decode instance with podGroupSize = 2, the minimum number of schedulable replicas of the prefill instance is its number of replicas, and the minimum number of schedulable replicas of the decode instance is equal to its podGroupSize.
                huawei.com/recover_policy_path: "pod"  # Path for job execution recovery when pod-rescheduling is set to on. If this parameter is set to "pod", job-level rescheduling is not triggered when pod-level rescheduling fails. Because each pod in the current PodGroup is an independent instance, fault handling cannot be propagated to other instances.
            spec:
              schedulerName: volcano          # Set the scheduler to Volcano.
              nodeSelector:
                accelerator-type: "module-a3-16-super-pod" # Set it based on the hardware form.
              containers:
                - name: prefill
                  image: vllm-ascend:xxx       # Image name
                  ...
                  resources:
                    limits:
                      "huawei.com/Ascend910": 16 # Number of NPUs
                    requests:
                      "huawei.com/Ascend910": 16
        ...                  
        - name: decode      # Decode definition
          replicas: 1     # Number of decode replicas
          podGroupSize: 2    # Number of decode pod replicas
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: "qwen3-moe"
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: vllm
                fault-scheduling: "force"    # Enable rescheduling.
              annotations:
                huawei.com/schedule_minAvailable: "2" # For details, see the prefill instance parameter description.
            spec:
              schedulerName: volcano
              nodeSelector:
                accelerator-type:  "module-a3-16-super-pod"
              containers:
                - name: decode
                  image: vllm-ascend:xxx
                  
                  ...
                  resources:
                    limits:
                      "huawei.com/Ascend910": 16 # Number of NPUs
                    requests:
                      "huawei.com/Ascend910": 16
        ...
        - name: routing    # Routing definition
          replicas: 1     # Number of routing replicas
          stateful: true
          template:
            spec:
              containers:
              - name: router
                image: xxx:yyy   # Routing image
                ...