Rescheduling an Inference Job Instance

Prerequisites

The AIBrix service has been deployed. For details, see AIBrix documentation.

Supported Fault Types

Processor and software faults

Rescheduling Principles

AIBrix generates a pod group based on role instances in the job YAML file. When an instance fails, all pods in the pod group of the instance are rescheduled. If podGroupSize is set to 1 for all instances, only one pod group is generated. When a fault occurs, failed pods of the corresponding instance are rescheduled.

Configuring Instance-Level Rescheduling

The following uses StormService YAML as an example to describe how to configure instance-level rescheduling. Add the following information in bold:

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: vllm-1p1d
spec:
  replicas: 1
  updateStrategy:
    type: InPlaceUpdate
  stateful: true
  selector:
    matchLabels:
      app: vllm-1p1d
  template:
    metadata:
      labels:
        app: vllm-1p1d
    spec:
      roles:
        - name: prefill
          replicas: 1
          stateful: true
          podGroupSize: 2
          template:
            metadata:
              labels:
                model.aibrix.ai/name: qwen3-8B
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: vllm
                fault-scheduling: "force"
                #pod-rescheduling: "on"   # Required only when podGroupSize = 1.
                fault-retry-times: "10"
            spec:
              schedulerName: volcano # Specifies the scheduler.
              restartPolicy: Never
              nodeSelector:
                accelerator-type: module-910b-8
              containers:
                - name: prefill
...
                  resources:
                    limits:
                      huawei.com/Ascend910: 8 # Number of NPUs required.
                    requests:
                      huawei.com/Ascend910: 8
                  securityContext:
...
        - name: decode
          replicas: 1
          podGroupSize: 2
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: qwen3-8B
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: vllm
                fault-scheduling: "force"
                #pod-rescheduling: "on"   # Required only when podGroupSize = 1.
                fault-retry-times: "10"
            spec:
              nodeSelector:
                accelerator-type: module-910b-8
              schedulerName: volcano
              restartPolicy: Never
              containers:
                - name: decode
...
                  resources:
                    limits:
                      huawei.com/Ascend910: 8
                    requests:
                      huawei.com/Ascend910: 8
                  securityContext:
...

Parent topic: Best Practices of vLLM Inference Jobs