YAML Configuration

Procedure

Upload the YAML file to any directory on the management node and modify the file content as required.

Refer to this configuration when using the elastic training feature. The following uses a800_vcjob.yaml as an example to describe how to create a single-server training job on an Atlas 800 training server node. The job uses eight processors. The modification example is as follows:

apiVersion: v1
kind: ConfigMap
metadata:
  name: rings-config-mindx-dls-test     # The name after rings-config- must be the same as the job name.
...
  labels:
    ring-controller.atlas: ascend-910    # Processor type used by a job
...
---
apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The Volcano API must be used.
kind: Job                               # The type can only be job.
metadata:
  name: mindx-dls-test                  # Job name, which can be customized.
  labels:
    ring-controller.atlas: ascend-910    # Processor type used by a job
    fault-scheduling: "grace"        # Enable rescheduling upon faults.
    elastic-scheduling: "on"          # Enable elastic training. Add the double quotation marks (").
  annotations:
    minReplicas: "1"                 # Minimum number of replicas
...
spec:
  minAvailable: 1                  # The value is 1.
...
  maxRetry: 0              # The value is 0.
...
  - name: "default-test"
      template:
        metadata:
...
        spec:
...
          env:
...
          - name: ASCEND_VISIBLE_DEVICES                       # This field is required by Ascend Docker Runtime.
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be consistent with resources.requests below.
...
            resources:  
              requests:
                huawei.com/Ascend910: 8          # The number of required NPUs is 8.
              limits:
                huawei.com/Ascend910: 8          # The value must be consistent with that in requests.
...
            nodeSelector:
              host-arch: huawei-arm       # Optional value. Set it as required.
...

To use the elastic training feature, expand the memory and add parameters based on the comments. In addition, use the maxRetry mechanism. The following is an example:

...
          volumeMounts:                             # Capacity expansion for elastic training
          - name: shm
           mountPath: /dev/shm
        volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi
...

To configure CPU and memory resources, manually add the cpu and memory parameters and their values by referring to the following example. Set the specific values as required.

...
          resources:  
            requests:
              huawei.com/Ascend910: 8
              cpu: 100m           
              memory: 100Gi       
            limits:
              huawei.com/Ascend910: 8
              cpu: 100m
              memory: 100Gi
...

Modify the mount paths of the training script and code.

The base image pulled from Ascend image repository does not contain files such as the training script and code. During training, these files are usually mounted and mapped to a container.

          volumeMounts:
          - name: ascend-910-config
            mountPath: /user/serverid/devindex/config
          - name: code
            mountPath: /job/code                     # Path of the training script in the container.
          - name: data
            mountPath: /job/data                      # Path of the training dataset in the container.
          - name: output
            mountPath: /job/output                    # Path of the training output in the container.

As shown below, the three parameters following bash train_start.sh in the YAML file are the directory of the training code in the container, the output directory (contains the generated log redirection file and TensorFlow model file), and the relative path of the startup script to the code directory. The subsequent parameters starting with -- are required by the training script. For details about how to modify the single-server and distributed training scripts and script parameters, see the model description in the model script source.
- TensorFlow command parameters
```
command:
- "/bin/bash"
- "-c"
- "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;"
...
```
- PyTorch command parameters
```
command:
- "/bin/bash"
- "-c"
- "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024 --resume=true;"
...
```
- Skip this step for models that use the MindSpore architecture, including the ResNet-50 and Pangu_alpha models.

If the NFS is used, specify the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.

...
          volumeMounts:
          - name: ascend-910-config
            mountPath: /user/serverid/devindex/config
          - name: code
            mountPath: /job/code                     # Path of the training script in the container.
          - name: data
            mountPath: /job/data                      # Path of the training dataset in the container.
          - name: output
            mountPath: /job/output                    # Path of the training output in the container.
...
        volumes:
...
        - name: code
          nfs:
            server: 127.0.0.1        # IP address of the NFS server.
            path: "xxxxxx"           # Training script path.
        - name: data
          nfs:
            server: 127.0.0.1
            path: "xxxxxx"           # Training dataset path.
        - name: output
          nfs:
            server: 127.0.0.1
            path: "xxxxxx"           # Set the path for saving the script-related model.
...

Parent topic: Preparation of Job YAML Files