Rescheduling an Inference Job Instance
Supported Fault Types
Processor and software faults
Rescheduling Principles
AIBrix generates a pod group based on role instances in the job YAML file. When an instance fails, all pods in the pod group of the instance are rescheduled. If podGroupSize is set to 1 for all instances, only one pod group is generated. When a fault occurs, failed pods of the corresponding instance are rescheduled.
Configuring Instance-Level Rescheduling
The following uses StormService YAML as an example to describe how to configure instance-level rescheduling. Add the following information in bold:
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
name: vllm-1p1d
spec:
replicas: 1
updateStrategy:
type: InPlaceUpdate
stateful: true
selector:
matchLabels:
app: vllm-1p1d
template:
metadata:
labels:
app: vllm-1p1d
spec:
roles:
- name: prefill
replicas: 1
stateful: true
podGroupSize: 2
template:
metadata:
labels:
model.aibrix.ai/name: qwen3-8B
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
fault-scheduling: "force"
#pod-rescheduling: "on" # Required only when podGroupSize = 1.
fault-retry-times: "10"
spec:
schedulerName: volcano # Specifies the scheduler.
restartPolicy: Never
nodeSelector:
accelerator-type: module-910b-8
containers:
- name: prefill
...
resources:
limits:
huawei.com/Ascend910: 8 # Number of NPUs required.
requests:
huawei.com/Ascend910: 8
securityContext:
...
- name: decode
replicas: 1
podGroupSize: 2
stateful: true
template:
metadata:
labels:
model.aibrix.ai/name: qwen3-8B
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
fault-scheduling: "force"
#pod-rescheduling: "on" # Required only when podGroupSize = 1.
fault-retry-times: "10"
spec:
nodeSelector:
accelerator-type: module-910b-8
schedulerName: volcano
restartPolicy: Never
containers:
- name: decode
...
resources:
limits:
huawei.com/Ascend910: 8
requests:
huawei.com/Ascend910: 8
securityContext:
...
Parent topic: Best Practices of vLLM Inference Jobs