Rescheduling an Inference Job Instance
If a node, processor, or other fault occurs in an inference job, the MindCluster cluster scheduling components can isolate the faulty resource and automatically trigger rescheduling. For details about the fault detection principles, see Fault Detection.
Prerequisite
You have deployed the OME-based SGLang inference service.
Instance Rescheduling Principles
Deleting Faulty Instance Pods
Deployment as the OME sub-workload (a prefill/decode instance consists of one pod):
- Service plane fault: If the pod's container exits with a non-zero code, it is automatically restarted.
- Hardware fault: After Ascend Device Plugin or NodeD reports a hardware fault to ClusterD, Volcano obtains the faulty node, deletes the pod on the node, and isolates the faulty node.
LeaderWorkerSet as the OME sub-workload (a prefill/decode instance consists of more than pods):
- Service plane fault: After the pod's container to which any instance belongs exits with a non-zero code, LWS Controller automatically deletes the PodGroup of the instance.
- Hardware fault: After Ascend Device Plugin or NodeD reports a hardware fault to ClusterD, Volcano obtains the faulty node, deletes the pod on the node, and isolates the faulty node. LWS Controller automatically deletes the PodGroup of the instance.
Re-creating and Scheduling Faulty Instance Pods
During fault recovery of OME jobs, only the faulty prefill/decode instance is rescheduled.
Configuring Instance-Level Rescheduling
The following uses ClusterServingRuntime as an example to describe how to configure instance-level rescheduling.
apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
name: lws-runtime
annotations:
sp-block: "16"
labels:
fault-scheduling: "force" # Enable rescheduling.
pod-rescheduling: "on" # Enable pod-level rescheduling.
fault-retry-times: "3" # Enable unconditional retry for service plane faults.
spec:
...