Preparing a Job YAML File
Prepare for image creation as required, select a YAML file, and modify the YAML file.
Prerequisite
You have prepared for image creation.
YAML Selection
Various YAML examples are provided for cluster scheduling. You can select an appropriate YAML example based on the used component, processor type, and job type, and make necessary modifications according to actual requirements before using it.
Type |
Hardware |
YAML File Name |
How to Obtain |
|---|---|---|---|
MS Controller |
- |
controller.yaml |
|
MS Coordinator |
- |
coordinator.yaml |
|
MindIE Server |
Atlas 800I A2 inference server Atlas 800I A3 SuperPoD Server |
server.yaml |
|
Note: If the Atlas 800I A3 SuperPoD Server is used, modify certain parameters after obtaining the YAML file as follows. |
|||
Job YAML Description
For example, if a MindIE Motor inference job contains one MS Controller instance, one MS Coordinator instance, x prefill instances, and y decode instances, the number of AscendJobs to be deployed is 1 + 1 + x + y.
- MS Controller and MS Coordinator do not require NPUs. They are deployed as AscendJobs and support multiple replicas. The following is an example YAML file of MS Controller and MS Coordinator:
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: mindie-ms-test-controller namespace: mindie labels: framework: pytorch app: mindie-ms-controller # Role of MindIE Motor in the AscendJob, which cannot be changed. jobID: mindie-ms-test # Unique ID of the MindIE Motor job in the cluster. Change the ID as required. ring-controller.atlas: ascend-910b spec: schedulerName: volcano # Scheduler selected when Ascend Operator enables gang scheduling. runPolicy: schedulingPolicy: # This field takes effect only when Ascend Operator enables gang scheduling and Volcano is used as the scheduler. minAvailable: 1 # Total number of running job replicas queue: default successPolicy: AllWorkers replicaSpecs: Master: replicas: 1 restartPolicy: Always template: metadata: ...
app and jobID are described as follows. For details about other parameters, see YAML Parameters.
app: role of MindIE Motor in the AscendJob. The value can be mindie-ms-controller, mindie-ms-coordinator, or mindie-ms-server.
jobID: unique ID of the MindIE Motor job in the cluster. You can configure the ID as required.
- Example YAML file of MindIE Server
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-mindie-server-0 # Must be the same as the name attribute of the AscendJob. The prefix "rings-config-" cannot be changed. namespace: mindie labels: jobID: mindie-ms-test ring-controller.atlas: ascend-910b mx-consumer-cim: "true" data: hccl.json: | { "status":"initializing" } --- apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: mindie-server-0 namespace: mindie labels: framework: pytorch app: mindie-ms-server # Role of MindIE Motor in the AscendJob, which cannot be changed. jobID: mindie-ms-test # Unique ID of the MindIE Motor job in the cluster. Change the ID as required. ring-controller.atlas: ascend-910b spec: schedulerName: volcano # Scheduler selected when Ascend Operator enables gang scheduling. runPolicy: schedulingPolicy: # This field takes effect only when Ascend Operator enables gang scheduling and Volcano is used as the scheduler. minAvailable: 2 # Total number of running job replicas queue: default successPolicy: AllWorkers replicaSpecs: Master:
- If the Atlas 800I A3 SuperPoD Server is used, modify the YAML file of the MindIE Server job as follows.
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: mindie-server-0 namespace: mindie labels: framework: pytorch app: mindie-ms-server # Cannot be modified. jobID: mindie-ms-test # Unique ID of the MindIE Motor job in the cluster. Change the ID as required. ring-controller.atlas: ascend-910b fault-scheduling: force annotations: sp-block: "16" # Add this annotation. For details, see YAML Parameters. spec: schedulerName: volcano # Scheduler selected when Ascend Operator enables gang scheduling. runPolicy: schedulingPolicy: # This field takes effect only when Ascend Operator enables gang scheduling and Volcano is used as the scheduler. minAvailable: 2 # Total number of running job replicas queue: default successPolicy: AllWorkers replicaSpecs: Master: replicas: 1 restartPolicy: Always template: metadata: labels: ring-controller.atlas: ascend-910b app: mindie-ms-server jobID: mindie-ms-test spec: nodeSelector: accelerator: huawei-Ascend910 # accelerator-type: module-910b-8 # Delete or comment out this nodeSelector.