Preparing a Job YAML File
Prepare for image creation as required, select a YAML file, and modify the YAML file.
Prerequisite
You have prepared for image creation. See SGLang documentation to obtain an SGLang inference image and obtain MemFabric Hybrid required by the image from MemFabric Hybrid.
YAML Selection
An OME-based SGLang inference job can be started by Base Model, Serving Runtime, and Inference Service CRDs. For details about the resource usage and deployment of Base Model and Inference Service, see OME documentation.
Various YAML examples of ClusterServingRuntime resources required by OME jobs are provided by cluster scheduling components. You can select an appropriate YAML example based on the used component, processor type, and job type, and make necessary modifications according to actual requirements before using it.
Type |
Hardware |
YAML File Name |
How to Obtain |
|---|---|---|---|
Non-cross node instance (Deployment scenario) |
Atlas 800I A2 inference server Atlas 800I A3 SuperPoD Server |
llama-3-2-1b-instruct-rt-pd-standalone.yaml |
|
Cross-node instance (LeaderWorkerSet scenario) |
Atlas 800I A2 inference server Atlas 800I A3 SuperPoD Server |
llama-3-2-1b-instruct-rt-pd-distributed.yaml |
|
Note: The provided YAML files are for test only. You can modify them as required. |
|||
After the Base Model, Serving Runtime, and Inference Service YAML files are modified based on the OME framework deployment mode, OME and its components are responsible for starting the sub-workload (Deployment or LeaderWorkerSet) and the corresponding pods, and managing the lifecycle of inference service pods. After inference servicepods are created, MindCluster schedules them.
Job YAML Description
apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
name: srt-llama-3-2-1b-instruct-distributed
spec:
decoderConfig:
annotations:
sp-block: "16" # (Required only by Atlas 900 A3 SuperPoD) Total number of NPUs of pod requests corresponding to one prefill/decode instance.
huawei.com/schedule_minAvailable: "2" # (Required only in the Deployment scenario) Number of replicas of the decode instance (equivalent to the engineConfig field of the prefill instance).
leader:
nodeSelector:
accelerator-type: module-a3-16-super-pod # Set this parameter based on the actual node type.
schedulerName: volcano # Set the scheduler to Volcano.
runner:
name: sglang-decoder
image: "sglang:xxx"
command:
...
env:
...
- name: ASCEND_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['huawei.com/Ascend910']
resources:
limits:
huawei.com/Ascend910: 16 #Set this parameter based on the number of NPUs required by each pod.
requests:
huawei.com/Ascend910: 16 #Set this parameter based on the number of NPUs required by each pod.
volumeMounts:
...
- name: driver
mountPath: /usr/local/Ascend/driver
...
volumes:
...
- name: driver
hostPath:
path: /usr/local/Ascend/driver
...