Rescheduling an Inference Job

If a node, processor, or other fault occurs in an inference job, the MindCluster cluster scheduling components can isolate the faulty resource and automatically trigger rescheduling. For details about the fault detection principles, see Fault Detection.

Prerequisite

You have performed operations in Deploying MindIE Motor.

Supported Fault Type

MindIE Server: node, processor, or other faults
MindIE MS: node faults

Rescheduling Principles

Job-level rescheduling: supported by MindIE Server and MindIE MS. When MindIE Server or MindIE MS is faulty, the corresponding MindIE Server or MindIE MS instance stops all pods, re-creates and reschedules them, and pushes the latest global-ranktable.json to MS Controller to restart the inference job.
For example, in the prefill-decode disaggregation scenario where MindIE Server contains one prefill instance and one decode instance, if the prefill instance is faulty, only all pods of the prefill instance are stopped, without affecting other instances that are running properly.
Pod-level rescheduling: supported only by MindIE MS. In the active/standby switchover scenario, the number of pods corresponding to MS Controller or MS Coordinator is greater than 1. If a node is faulty, only the pods of this node are stopped. For example, when the active MS Coordinator and standby MS Coordinator both exist and the active MS Coordinator is faulty, only the pod of the active MS Coordinator is stopped, and the standby MS Coordinator is not affected.

If pod-level rescheduling fails, job-level rescheduling is used.

Configuring Job-Level Rescheduling

Job-level rescheduling is enabled by default. You only need to prepare a job YAML file. The following uses MindIE Server as an example to describe how to configure job-level rescheduling.

apiVersion: mindxdl.gitee.com/v1
kind: AscendJob
metadata:
  name: mindie-server-0
  namespace: mindie
  labels:
    framework: pytorch        
    app: mindie-ms-server       # Role of MindIE Motor in the AscendJob, which cannot be changed.
    jobID: mindie-ms-test       # Unique ID of the MindIE Motor inference job in the cluster. Change the ID as required.
    fault-scheduling: force    # Enable rescheduling.
    ring-controller.atlas: ascend-910b
spec:
  schedulerName: volcano    # Scheduler selected when Ascend Operator enables gang scheduling.
  runPolicy:
    schedulingPolicy:     # This field takes effect only when Ascend Operator enables gang scheduling and Volcano is used as the scheduler.
      minAvailable: 2    # Total number of running job replicas
      queue: default
  successPolicy: AllWorkers
  replicaSpecs:
    Master:

Configuring Pod-Level Rescheduling

Currently, pod-level rescheduling is supported only by MS Controller and MS Coordinator. You are advised to enable this function when active/standby switchover is enabled. The following uses MS Coordinator as an example to describe how to configure pod-level rescheduling.

apiVersion: mindxdl.gitee.com/v1
kind: AscendJob
metadata:
  name: mindie-coordinator
  namespace: mindie
  labels:
    framework: pytorch        
    app: mindie-ms-coordinator        # Role of MindIE Motor in the AscendJob, which cannot be changed.
    jobID: mindie-ms-test             # Unique ID of the MindIE Motor inference job in the cluster. Change the ID as required.
   fault-scheduling: force           # Enable rescheduling.
   pod-rescheduling: "on"           # Enable pod-level rescheduling
    ring-controller.atlas: ascend-910b
spec:
  schedulerName: volcano    # Scheduler selected when Ascend Operator enables gang scheduling.
  runPolicy:
    schedulingPolicy:     # This field takes effect only when Ascend Operator enables gang scheduling and Volcano is used as the scheduler.
      minAvailable: 2       # Total number of running job replicas
      queue: default
  successPolicy: AllWorkers
  replicaSpecs:
    Master:

Parent topic: Best Practices of MindIE Motor Inference Jobs