Preparing a Job YAML File

Prepare for image creation as required, select a YAML file, and modify the YAML file.

Prerequisite

You have prepared for image creation.

YAML Selection

Various YAML examples are provided for cluster scheduling. You can select an appropriate YAML example based on the used component, processor type, and job type, and make necessary modifications according to actual requirements before using it.

Type

Hardware

YAML File Name

How to Obtain

MS Controller

-

controller.yaml

Click here.

MS Coordinator

-

coordinator.yaml

Click here.

MindIE Server

Atlas 800I A2 inference server

Atlas 800I A3 SuperPoD Server

server.yaml

Click here.

Note:

If the Atlas 800I A3 SuperPoD Server is used, modify certain parameters after obtaining the YAML file as follows.

Job YAML Description

Compared with a common AscendJob, the MindIE Motor inference job requires the two additional labels: app and jobID. MindIE Server requires NPUs. Ensure that the number of AscendJobs delivered matches the number of prefill instances and decode instances.

For example, if a MindIE Motor inference job contains one MS Controller instance, one MS Coordinator instance, x prefill instances, and y decode instances, the number of AscendJobs to be deployed is 1 + 1 + x + y.

  • MS Controller and MS Coordinator do not require NPUs. They are deployed as AscendJobs and support multiple replicas. The following is an example YAML file of MS Controller and MS Coordinator:
    apiVersion: mindxdl.gitee.com/v1
    kind: AscendJob
    metadata:
      name: mindie-ms-test-controller
      namespace: mindie
      labels:
        framework: pytorch          
        app: mindie-ms-controller    # Role of MindIE Motor in the AscendJob, which cannot be changed.
        jobID: mindie-ms-test      # Unique ID of the MindIE Motor job in the cluster. Change the ID as required.
        ring-controller.atlas: ascend-910b
    spec:
      schedulerName: volcano    # Scheduler selected when Ascend Operator enables gang scheduling.
      runPolicy:
        schedulingPolicy:     # This field takes effect only when Ascend Operator enables gang scheduling and Volcano is used as the scheduler.
          minAvailable: 1    # Total number of running job replicas
          queue: default
      successPolicy: AllWorkers
      replicaSpecs:
        Master:
          replicas: 1
          restartPolicy: Always
          template:
            metadata:
              ...

app and jobID are described as follows. For details about other parameters, see YAML Parameters.

app: role of MindIE Motor in the AscendJob. The value can be mindie-ms-controller, mindie-ms-coordinator, or mindie-ms-server.

jobID: unique ID of the MindIE Motor job in the cluster. You can configure the ID as required.

  • Example YAML file of MindIE Server
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: rings-config-mindie-server-0  # Must be the same as the name attribute of the AscendJob. The prefix "rings-config-" cannot be changed.
      namespace: mindie
      labels:
        jobID: mindie-ms-test
        ring-controller.atlas: ascend-910b
        mx-consumer-cim: "true"
    data:
      hccl.json: |
        {
            "status":"initializing"
        }
    ---
    apiVersion: mindxdl.gitee.com/v1
    kind: AscendJob
    metadata:
      name: mindie-server-0
      namespace: mindie
      labels:
        framework: pytorch        
        app: mindie-ms-server        # Role of MindIE Motor in the AscendJob, which cannot be changed.
        jobID: mindie-ms-test       # Unique ID of the MindIE Motor job in the cluster. Change the ID as required.
        ring-controller.atlas: ascend-910b
    spec:
      schedulerName: volcano    # Scheduler selected when Ascend Operator enables gang scheduling.
      runPolicy:
        schedulingPolicy:     # This field takes effect only when Ascend Operator enables gang scheduling and Volcano is used as the scheduler.
          minAvailable: 2    # Total number of running job replicas
          queue: default
      successPolicy: AllWorkers
      replicaSpecs:
        Master:
  • If the Atlas 800I A3 SuperPoD Server is used, modify the YAML file of the MindIE Server job as follows.
    apiVersion: mindxdl.gitee.com/v1
    kind: AscendJob
    metadata:
      name: mindie-server-0
      namespace: mindie
      labels:
        framework: pytorch
        app: mindie-ms-server      # Cannot be modified.
        jobID: mindie-ms-test       # Unique ID of the MindIE Motor job in the cluster. Change the ID as required.
        ring-controller.atlas: ascend-910b
        fault-scheduling: force
      annotations:
        sp-block: "16"         # Add this annotation. For details, see YAML Parameters.
    spec:
      schedulerName: volcano    # Scheduler selected when Ascend Operator enables gang scheduling.
      runPolicy:
        schedulingPolicy:     # This field takes effect only when Ascend Operator enables gang scheduling and Volcano is used as the scheduler.
          minAvailable: 2    # Total number of running job replicas
          queue: default
      successPolicy: AllWorkers
      replicaSpecs:
        Master:
          replicas: 1
          restartPolicy: Always
          template:
            metadata:
              labels:
                ring-controller.atlas: ascend-910b
                app: mindie-ms-server
                jobID: mindie-ms-test
            spec:
              nodeSelector:
                accelerator: huawei-Ascend910
                # accelerator-type: module-910b-8  # Delete or comment out this nodeSelector.