Preparing a Job YAML File

Prepare for image creation as required, select a YAML file, and modify the YAML file.

Prerequisite

You have prepared for image creation. See SGLang documentation to obtain an SGLang inference image and obtain MemFabric Hybrid required by the image from MemFabric Hybrid.

YAML Selection

An OME-based SGLang inference job can be started by Base Model, Serving Runtime, and Inference Service CRDs. For details about the resource usage and deployment of Base Model and Inference Service, see OME documentation.

Various YAML examples of ClusterServingRuntime resources required by OME jobs are provided by cluster scheduling components. You can select an appropriate YAML example based on the used component, processor type, and job type, and make necessary modifications according to actual requirements before using it.

Type	Hardware	YAML File Name	How to Obtain
Non-cross node instance (Deployment scenario)	Atlas 800I A2 inference server Atlas 800I A3 SuperPoD Server	llama-3-2-1b-instruct-rt-pd-standalone.yaml	Click here.
Cross-node instance (LeaderWorkerSet scenario)	Atlas 800I A2 inference server Atlas 800I A3 SuperPoD Server	llama-3-2-1b-instruct-rt-pd-distributed.yaml	Click here.
Note: The provided YAML files are for test only. You can modify them as required.

Type

Hardware

YAML File Name

How to Obtain

Non-cross node instance (Deployment scenario)

Atlas 800I A2 inference server

Atlas 800I A3 SuperPoD Server

llama-3-2-1b-instruct-rt-pd-standalone.yaml

Click here.

Cross-node instance (LeaderWorkerSet scenario)

Atlas 800I A2 inference server

Atlas 800I A3 SuperPoD Server

llama-3-2-1b-instruct-rt-pd-distributed.yaml

Click here.

Note: The provided YAML files are for test only. You can modify them as required.

After the Base Model, Serving Runtime, and Inference Service YAML files are modified based on the OME framework deployment mode, OME and its components are responsible for starting the sub-workload (Deployment or LeaderWorkerSet) and the corresponding pods, and managing the lifecycle of inference service pods. After inference servicepods are created, MindCluster schedules them.

Job YAML Description

apiVersion: ome.io/v1beta1
kind: ClusterServingRuntime
metadata:
  name: srt-llama-3-2-1b-instruct-distributed     
spec:
  decoderConfig:
    annotations:
      sp-block: "16"  # (Required only by Atlas 900 A3 SuperPoD) Total number of NPUs of pod requests corresponding to one prefill/decode instance.
       huawei.com/schedule_minAvailable: "2" # (Required only in the Deployment scenario) Number of replicas of the decode instance (equivalent to the engineConfig field of the prefill instance).
    leader:
      nodeSelector:
        accelerator-type: module-a3-16-super-pod   # Set this parameter based on the actual node type.
        schedulerName: volcano  # Set the scheduler to Volcano.
      runner:
        name: sglang-decoder
        image: "sglang:xxx"
        command:
        ...
        env:
        ...
        - name: ASCEND_VISIBLE_DEVICES
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['huawei.com/Ascend910']
        resources:
          limits:
           huawei.com/Ascend910: 16  #Set this parameter based on the number of NPUs required by each pod.
          requests:
           huawei.com/Ascend910: 16  #Set this parameter based on the number of NPUs required by each pod.
       volumeMounts:
       ...
       - name: driver
         mountPath: /usr/local/Ascend/driver
       ...
     volumes:
      ...
      - name: driver
        hostPath:
        path: /usr/local/Ascend/driver
    ...

Parent topic: Use on the CLI