配置推理任务实例重调度
支持的故障类型
芯片、软件故障
重调度原理
AIBrix根据任务YAML中的role实例生成PodGroup,对应实例发生故障时,重调度实例PodGroup下的所有Pod。若实例配置的podGroupSize均配置为1,只会生成一个PodGroup。发生故障时重调度对应实例的故障Pod。
配置实例级重调度
以StormService YAML为例配置实例级重调度,添加以下加粗部分配置。
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
name: vllm-1p1d
spec:
replicas: 1
updateStrategy:
type: InPlaceUpdate
stateful: true
selector:
matchLabels:
app: vllm-1p1d
template:
metadata:
labels:
app: vllm-1p1d
spec:
roles:
- name: prefill
replicas: 1
stateful: true
podGroupSize: 2
template:
metadata:
labels:
model.aibrix.ai/name: qwen3-8B
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
fault-scheduling: "force"
#pod-rescheduling: "on" # 若podGroupSize均为1则需配置该标签,podGroupSize大于1时,无需配置
fault-retry-times: "10"
spec:
schedulerName: volcano # 指定调度器
restartPolicy: Never
nodeSelector:
accelerator-type: module-910b-8
containers:
- name: prefill
...
resources:
limits:
huawei.com/Ascend910: 8 # 配置所需NPU数
requests:
huawei.com/Ascend910: 8
securityContext:
...
- name: decode
replicas: 1
podGroupSize: 2
stateful: true
template:
metadata:
labels:
model.aibrix.ai/name: qwen3-8B
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
fault-scheduling: "force"
#pod-rescheduling: "on" # 若podGroupSize均为1则需配置该标签,podGroupSize大于1时,无需配置
fault-retry-times: "10"
spec:
nodeSelector:
accelerator-type: module-910b-8
schedulerName: volcano
restartPolicy: Never
containers:
- name: decode
...
resources:
limits:
huawei.com/Ascend910: 8
requests:
huawei.com/Ascend910: 8
securityContext:
...
父主题: vLLM推理任务最佳实践