Preparation of Job YAML Files

The cluster scheduling components provide YAML examples. You need to select an appropriate YAML example based on the functionality, model type, job type, and fault handling mode, and make necessary modifications according to actual requirements before using it.

Table 1 YAML examples

Job Type

Hardware Model

Training Framework

Model

YAML File Name

How to Obtain

Description

AscendJob

  • Atlas 800T A2 training server
  • Atlas 900 A2 PoD cluster basic unit

PyTorch

Qwen3

pytorch_multinodes_acjob_910b.yaml

pytorch_multinodes_acjob_910b.yaml

A two-server eight-processor job is presented in the example file by default.

AscendJob

  • Atlas 800T A2 training server
  • Atlas 900 A2 PoD cluster basic unit

MindSpore

Qwen3

ms_multinodes_acjob_superpod.yaml

ms_multinodes_acjob_superpod.yaml

A two-server 16-processor job is presented in the example file by default.

AscendJob

Atlas 900 A3 SuperPoD

verl

Qwen3-30B

verl-resche.yaml

verl-resche.yaml

A two-server 16-processor job is presented in the example file by default.

Currently, resumable training does not provide the example YAML file of the Atlas 900 A3 SuperPoD. You can add the annotations field under labels in the example YAML file. Example:
...
  labels: 
...
  annotations:
    sp-block: "32"   # Number of processors on a logical SuperPoD. For details about the sp-block field, see YAML Parameters.
...