Creating a Job YAML File

Procedure

Select a YAML file based on the description in Instructions.

The YAML example is used in the NFS scenario. The NFS needs to be installed on the storage node. For details about how to install the NFS, see Installing NFS.

Download the YAML file from the cluster scheduling componentGitee code repository. For details about the YAML file, see Table 1.

**Table 1** YAML files of different job types and hardware models
Job Type	Hardware Model	Training Framework	YAML File Path	Description
Volcano Job	Atlas 800 training server	TensorFlow	samples/train/yaml/a800_tensorflow_vcjob.yaml	A single-server eight-processor job is presented in the example file by default.
		PyTorch	samples/train/yaml/a800_pytorch_vcjob.yaml
		MindSpore	samples/train/yaml/a800_mindspore_vcjob.yaml
	Server (with Atlas 300T training cards)	TensorFlow	samples/train/yaml/a300t_tensorflow_vcjob.yaml	A single-server single-processor job is presented in the example file by default.
		PyTorch	samples/train/yaml/a300t_pytorch_vcjob.yaml
		MindSpore	samples/train/yaml/a300t_mindspore_vcjob.yaml
Deployment	Atlas 800 training server	TensorFlow	samples/train/yaml/a800_tensorflow_deployment.yaml	A single-server eight-processor job is presented in the example file by default.
		PyTorch	samples/train/yaml/a800_pytorch_deployment.yaml
		MindSpore	samples/train/yaml/a800_mindspore_deployment.yaml
	Server (with Atlas 300T training cards)	TensorFlow	samples/train/yaml/a300t_tensorlfow_deployment.yaml	A single-server single-processor job is presented in the example file by default.
		PyTorch	samples/train/yaml/a300t_pytorch_deployment.yaml
		MindSpore	samples/train/yaml/a300t_mindspore_deployment.yaml

Upload the YAML file to any directory on the master node and modify the file content as required. Table 2 describes some parameters.

**Table 2** Parameters in the YAML file
Parameter	Value	Description
minAvailable	Single server: 1 Distributed training: N	N indicates the number of nodes. This parameter is not required for Deployment jobs. You are advised to set this parameter to the same value as replicas.
replicas	Single server: 1 Distributed training: N	N indicates the number of nodes.
image	-	Training image name. Set this parameter as required.
host-arch	ARM environment: huawei-arm x86 environment: huawei-x86	Architecture of the node where a training job is executed. Set this parameter as required. In a distributed training job, ensure that the nodes running the training job have the same architecture.
accelerator-type	Atlas 800 training server: module Atlas 800 training server (half configuration of NPUs): half Server (with Atlas 300T training cards): card	Set this parameter based on the type of the node where a training job is executed. (Optional) If the node is the Atlas 800 training server (full configuration of NPUs), this label can be omitted.
huawei.com/Ascend910	Atlas 800 training server: Single-server single-processor: 1 Single-server multi-processor: 2, 4, 8 Distributed training: 8 Atlas 800 training server (half configuration of NPUs) Single-server single-processor: 1 Single-server multi-processor: 2, 4 Distributed training: 4 Server (with Atlas 300T training cards): Single-server single-processor: 1 Single-server multi-processor: 2 Distributed training: 2	Number of requested NPUs. Set this parameter as required.

The a800_tensorflow_vcjob.yaml file is used as an example. Two Atlas 800 training server nodes are used to execute 2 x 8P distributed training jobs. The modification is as follows:

...
minAvailable: 2                # For a distributed training job on two nodes, the value is set to 2. For a distributed training job on N nodes, the value is set to N. This parameter is not required for Deployment jobs.
...
- name: "default-test"
    replicas: 2                  # The value of replicas is N and the number of NPUs in the requests field is 8 in an N-node scenario.
    template:
      metadata:
...
          resources:  
            requests:
              huawei.com/Ascend910: 8          # The number of required NPUs is 8. You can add lines below to configure resources such as memory and CPU.
            limits:
              huawei.com/Ascend910: 8          # The value must be consistent with that in requests.
...

If CPU and memory resources need to be configured, configure them as follows and set the values as required:

...
          resources:  
            requests:
              huawei.com/Ascend910: 8
              cpu: 100m                # means 100 milliCPU.For example 100m CPU, 100 milliCPU, and 0.1 CPU are all the same
              memory: 100Gi            # means 100*2³⁰ bytes of memory
            limits:
              huawei.com/Ascend910: 8
              cpu: 100m
              memory: 100Gi
...

As shown in the following example, the three parameters following train_start.sh in the training command are the training code directory, log file, and startup script path relative to the code directory in the container. The parameters starting with -- are required by the training script. For details about how to modify the single-node and distributed training scripts and script parameters, see the model description in the model script source.

TensoFlow command parameters

...
command:
- "/bin/bash"
- "-c"
- "cd /job/code/ResNet50_for_TensorFlow_2.6_code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ResNet50_for_TensorFlow_2.6_code/ /job/output/logs tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF/  --distribution_strategy=one_device --use_tf_while_loop=true  --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export --model_dir=/job/output "# Not all parameters are listed here.
...

PyTorch command parameters

...
command:
- "/bin/bash"
- "-c"
- "cd /job/code/ResNet50_for_PyTorch_1.5_code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ResNet50_for_PyTorch_1.5_code/ /job/output/logs DistributedResnet50/main_apex_d76_npu.py --data=/job/data/imagenet --seed=49 --worker=128  --print-freq=1 --dist-url='tcp://127.0.0.1:50000' --dist-backend='hccl' --multiprocessing-distributed --benchmark=0 --device='npu';"# Not all parameters are listed here.
...

MindSpore command parameters

...
command:
- "/bin/bash"
- "-c"
- "cd /job/code/ResNet50_for_MindSpore_1.9_code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ResNet50_for_MindSpore_1.9_code/ /job/output/logs train.py --data_path=/job/data/imagenet/train --dataset=resnet50  --output_path=/job/output/ --run_distribute=True --device_num=8..."# Not all parameters are listed here.
...

Change the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.

...
          volumeMounts:
          - name: ascend-910-config
            mountPath: /user/serverid/devindex/config
          - name: code
            mountPath: /job/code/                     # Path of the training script in the container.
          - name: data
            mountPath: /job/data                      # Path of the training dataset in the container.
          - name: output
            mountPath: /job/output                    # Path of the training output in the container.
...
        volumes:
...
        - name: code
          nfs:
            server: 127.0.0.1        # IP address of the NFS server.
            path: "xxxxxx"           # Training script path.
        - name: data
          nfs:
            server: 127.0.0.1
            path: "xxxxxx"           # Training dataset path.
        - name: output
          nfs:
            server: 127.0.0.1
            path: "xxxxxx"           # Path for saving the configuration model, which is related to the script.
...

Parent topic: Training Job