Process Description

An OME-based SGLang inference job contains the Router pod (not requiring NPU resources) and inference instance pod. The inference instance pod is classified into the prefill instance pod and decode instance pod. OME generates different workloads based on different inference service configuration modes to create different inference instances, and the Router provides inference services for external systems in a unified manner. MindCluster cluster scheduling components are able to schedule workloads of OME's Deployment and LeaderWorkerSet inference jobs. LWS gang scheduling needs to be enabled in LeaderWorkerSet scenarios.

For more details, see OME documentation and LWS document.

Procedure

Figure 1 shows the procedure for using MindCluster cluster scheduling components to deploy OME-based SGLang inference jobs via commands.

Figure 1 Procedure

Parent topic: Use on the CLI