Rescheduling an Inference Job Instance

If a node, processor, or other fault occurs in an inference job, the MindCluster cluster scheduling components can isolate the faulty resource and automatically trigger rescheduling. For details about the fault detection principles, see Fault Detection.

Prerequisite

You have deployed the OME-based SGLang inference service.

Instance Rescheduling Principles

Deleting Faulty Instance Pods

Deployment as the OME sub-workload (a prefill/decode instance consists of one pod):

  • Service plane fault: If the pod's container exits with a non-zero code, it is automatically restarted.
  • Hardware fault: After Ascend Device Plugin or NodeD reports a hardware fault to ClusterD, Volcano obtains the faulty node, deletes the pod on the node, and isolates the faulty node.

LeaderWorkerSet as the OME sub-workload (a prefill/decode instance consists of more than pods):

  • Service plane fault: After the pod's container to which any instance belongs exits with a non-zero code, LWS Controller automatically deletes the PodGroup of the instance.
  • Hardware fault: After Ascend Device Plugin or NodeD reports a hardware fault to ClusterD, Volcano obtains the faulty node, deletes the pod on the node, and isolates the faulty node. LWS Controller automatically deletes the PodGroup of the instance.

Re-creating and Scheduling Faulty Instance Pods

After the pod to which the Deployment or LeaderWorkerSet belongs is deleted by Volcano, the corresponding Controller creates the deleted pod again, and Volcano reschedules the restored pod.

During fault recovery of OME jobs, only the faulty prefill/decode instance is rescheduled.

Configuring Instance-Level Rescheduling

The following uses ClusterServingRuntime as an example to describe how to configure instance-level rescheduling.

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: lws-runtime
  annotations:
    sp-block: "16"
  labels:
    fault-scheduling: "force"          # Enable rescheduling.
   pod-rescheduling: "on"           # Enable pod-level rescheduling.
    fault-retry-times: "3"             # Enable unconditional retry for service plane faults.
spec:
...