YAML Configuration
Procedure
- Upload the YAML file to any directory on the management node and modify the file content as required.Refer to this configuration when using the elastic training feature. The following uses a800_vcjob.yaml as an example to describe how to create a single-server training job on an Atlas 800 training server node. The job uses eight processors. The modification example is as follows:
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-mindx-dls-test # The name after rings-config- must be the same as the job name. ... labels: ring-controller.atlas: ascend-910 # Processor type used by a job ... --- apiVersion: batch.volcano.sh/v1alpha1 # The value cannot be changed. The Volcano API must be used. kind: Job # The type can only be job. metadata: name: mindx-dls-test # Job name, which can be customized. labels: ring-controller.atlas: ascend-910 # Processor type used by a job fault-scheduling: "grace" # Enable rescheduling upon faults. elastic-scheduling: "on" # Enable elastic training. Add the double quotation marks ("). annotations: minReplicas: "1" # Minimum number of replicas ... spec: minAvailable: 1 # The value is 1. ... maxRetry: 0 # The value is 0. ... - name: "default-test" template: metadata: ... spec: ... env: ... - name: ASCEND_VISIBLE_DEVICES # This field is required by Ascend Docker Runtime. valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # The value must be consistent with resources.requests below. ... resources: requests: huawei.com/Ascend910: 8 # The number of required NPUs is 8. limits: huawei.com/Ascend910: 8 # The value must be consistent with that in requests. ... nodeSelector: host-arch: huawei-arm # Optional value. Set it as required. ... - To use the elastic training feature, expand the memory and add parameters based on the comments. In addition, use the maxRetry mechanism. The following is an example:
... volumeMounts: # Capacity expansion for elastic training - name: shm mountPath: /dev/shm volumes: - name: shm emptyDir: medium: Memory sizeLimit: 16Gi ... - To configure CPU and memory resources, manually add the cpu and memory parameters and their values by referring to the following example. Set the specific values as required.
... resources: requests: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi limits: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi ... - Modify the mount paths of the training script and code.
The base image pulled from Ascend image repository does not contain files such as the training script and code. During training, these files are usually mounted and mapped to a container.
volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # Path of the training script in the container. - name: data mountPath: /job/data # Path of the training dataset in the container. - name: output mountPath: /job/output # Path of the training output in the container. - As shown below, the three parameters following bash train_start.sh in the YAML file are the directory of the training code in the container, the output directory (contains the generated log redirection file and TensorFlow model file), and the relative path of the startup script to the code directory. The subsequent parameters starting with -- are required by the training script. For details about how to modify the single-server and distributed training scripts and script parameters, see the model description in the model script source.
- TensorFlow command parameters
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;" ...
- PyTorch command parameters
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024 --resume=true;" ...
- Skip this step for models that use the MindSpore architecture, including the ResNet-50 and Pangu_alpha models.
- TensorFlow command parameters
- If the NFS is used, specify the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.
... volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # Path of the training script in the container. - name: data mountPath: /job/data # Path of the training dataset in the container. - name: output mountPath: /job/output # Path of the training output in the container. ... volumes: ... - name: code nfs: server: 127.0.0.1 # IP address of the NFS server. path: "xxxxxx" # Training script path. - name: data nfs: server: 127.0.0.1 path: "xxxxxx" # Training dataset path. - name: output nfs: server: 127.0.0.1 path: "xxxxxx" # Set the path for saving the script-related model. ...
Parent topic: Preparation of Job YAML Files