Creating a Job YAML File

Procedure

Select a YAML file based on the description in Instructions.

The YAML example is used in the NFS scenario. The NFS needs to be installed on the storage node. For details about how to install the NFS, see Installing NFS.

  1. Download the YAML file from the cluster scheduling componentGitee code repository. For details about the YAML file, see Table 1.
    Table 1 YAML files of different job types and hardware models

    Job Type

    Hardware Model

    Training Framework

    YAML File Path

    Description

    Volcano Job

    Atlas 800 training server

    TensorFlow

    samples/train/yaml/a800_tensorflow_vcjob.yaml

    A single-server eight-processor job is presented in the example file by default.

    PyTorch

    samples/train/yaml/a800_pytorch_vcjob.yaml

    MindSpore

    samples/train/yaml/a800_mindspore_vcjob.yaml

    Server (with Atlas 300T training cards)

    TensorFlow

    samples/train/yaml/a300t_tensorflow_vcjob.yaml

    A single-server single-processor job is presented in the example file by default.

    PyTorch

    samples/train/yaml/a300t_pytorch_vcjob.yaml

    MindSpore

    samples/train/yaml/a300t_mindspore_vcjob.yaml

    Deployment

    Atlas 800 training server

    TensorFlow

    samples/train/yaml/a800_tensorflow_deployment.yaml

    A single-server eight-processor job is presented in the example file by default.

    PyTorch

    samples/train/yaml/a800_pytorch_deployment.yaml

    MindSpore

    samples/train/yaml/a800_mindspore_deployment.yaml

    Server (with Atlas 300T training cards)

    TensorFlow

    samples/train/yaml/a300t_tensorlfow_deployment.yaml

    A single-server single-processor job is presented in the example file by default.

    PyTorch

    samples/train/yaml/a300t_pytorch_deployment.yaml

    MindSpore

    samples/train/yaml/a300t_mindspore_deployment.yaml

  2. Upload the YAML file to any directory on the master node and modify the file content as required. Table 2 describes some parameters.
    Table 2 Parameters in the YAML file

    Parameter

    Value

    Description

    minAvailable

    • Single server: 1
    • Distributed training: N

    N indicates the number of nodes. This parameter is not required for Deployment jobs. You are advised to set this parameter to the same value as replicas.

    replicas

    • Single server: 1
    • Distributed training: N

    N indicates the number of nodes.

    image

    -

    Training image name. Set this parameter as required.

    host-arch

    ARM environment: huawei-arm

    x86 environment: huawei-x86

    Architecture of the node where a training job is executed. Set this parameter as required.

    In a distributed training job, ensure that the nodes running the training job have the same architecture.

    accelerator-type

    Atlas 800 training server: module

    Atlas 800 training server (half configuration of NPUs): half

    Server (with Atlas 300T training cards): card

    Set this parameter based on the type of the node where a training job is executed. (Optional) If the node is the Atlas 800 training server (full configuration of NPUs), this label can be omitted.

    huawei.com/Ascend910

    Atlas 800 training server:

    • Single-server single-processor: 1
    • Single-server multi-processor: 2, 4, 8
    • Distributed training: 8

    Atlas 800 training server (half configuration of NPUs)

    • Single-server single-processor: 1
    • Single-server multi-processor: 2, 4
    • Distributed training: 4

    Server (with Atlas 300T training cards):

    • Single-server single-processor: 1
    • Single-server multi-processor: 2
    • Distributed training: 2

    Number of requested NPUs. Set this parameter as required.

    The a800_tensorflow_vcjob.yaml file is used as an example. Two Atlas 800 training server nodes are used to execute 2 x 8P distributed training jobs. The modification is as follows:
    ...
    minAvailable: 2                # For a distributed training job on two nodes, the value is set to 2. For a distributed training job on N nodes, the value is set to N. This parameter is not required for Deployment jobs.
    ...
    - name: "default-test"
        replicas: 2                  # The value of replicas is N and the number of NPUs in the requests field is 8 in an N-node scenario.
        template:
          metadata:
    ...
              resources:  
                requests:
                  huawei.com/Ascend910: 8          # The number of required NPUs is 8. You can add lines below to configure resources such as memory and CPU.
                limits:
                  huawei.com/Ascend910: 8          # The value must be consistent with that in requests.
    ...
    If CPU and memory resources need to be configured, configure them as follows and set the values as required:
    ...
              resources:  
                requests:
                  huawei.com/Ascend910: 8
                  cpu: 100m                # means 100 milliCPU.For example 100m CPU, 100 milliCPU, and 0.1 CPU are all the same
                  memory: 100Gi            # means 100*230 bytes of memory
                limits:
                  huawei.com/Ascend910: 8
                  cpu: 100m
                  memory: 100Gi
    ...
  3. As shown in the following example, the three parameters following train_start.sh in the training command are the training code directory, log file, and startup script path relative to the code directory in the container. The parameters starting with -- are required by the training script. For details about how to modify the single-node and distributed training scripts and script parameters, see the model description in the model script source.
    • TensoFlow command parameters
      ...
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/ResNet50_for_TensorFlow_2.6_code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ResNet50_for_TensorFlow_2.6_code/ /job/output/logs tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF/  --distribution_strategy=one_device --use_tf_while_loop=true  --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export --model_dir=/job/output "# Not all parameters are listed here.
      ...
    • PyTorch command parameters
      ...
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/ResNet50_for_PyTorch_1.5_code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ResNet50_for_PyTorch_1.5_code/ /job/output/logs DistributedResnet50/main_apex_d76_npu.py --data=/job/data/imagenet --seed=49 --worker=128  --print-freq=1 --dist-url='tcp://127.0.0.1:50000' --dist-backend='hccl' --multiprocessing-distributed --benchmark=0 --device='npu';"# Not all parameters are listed here.
      ...
    • MindSpore command parameters
      ...
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/ResNet50_for_MindSpore_1.9_code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ResNet50_for_MindSpore_1.9_code/ /job/output/logs train.py --data_path=/job/data/imagenet/train --dataset=resnet50  --output_path=/job/output/ --run_distribute=True --device_num=8..."# Not all parameters are listed here.
      ...
  4. Change the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.
    ...
              volumeMounts:
              - name: ascend-910-config
                mountPath: /user/serverid/devindex/config
              - name: code
                mountPath: /job/code/                     # Path of the training script in the container.
              - name: data
                mountPath: /job/data                      # Path of the training dataset in the container.
              - name: output
                mountPath: /job/output                    # Path of the training output in the container.
    ...
            volumes:
    ...
            - name: code
              nfs:
                server: 127.0.0.1        # IP address of the NFS server.
                path: "xxxxxx"           # Training script path.
            - name: data
              nfs:
                server: 127.0.0.1
                path: "xxxxxx"           # Training dataset path.
            - name: output
              nfs:
                server: 127.0.0.1
                path: "xxxxxx"           # Path for saving the configuration model, which is related to the script.
    ...