Use on the CLI (Other Schedulers)
The process of using other schedulers on the CLI is the same as that of using Volcano. The only difference is that the required job YAML files are different. Prepare and use the corresponding YAML files by referring to Use on the CLI (Volcano).
Procedure
- Upload the YAML file to any directory on the management node and modify the file content as required.
No YAML example is provided for using other schedulers for cluster scheduling. However, you can obtain the YAML example of Volcano and modify it as follows.
The following uses tensorflow_standalone_acjob.yaml as an example to describe how to create a single-server training job on an Atlas 800T A2 training server. The job uses eight processors. The modification example is as follows:apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: default-test-tensorflow labels: framework: tensorflow ring-controller.atlas: ascend-{xxx}b spec: schedulerName: volcano # Delete this field when using other schedulers. runPolicy: # Delete this field when using other schedulers. schedulingPolicy: minAvailable: 1 queue: default successPolicy: AllWorkers replicaSpecs: Chief: replicas: 1 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-{xxx}b spec: nodeSelector: host-arch: huawei-arm accelerator-type: module-{xxx}b-8 containers: - name: ascend ... env: ... # ASCEND_VISIBLE_DEVICES is not supported when other schedulers are used. Delete the following information in bold: - name: ASCEND_VISIBLE_DEVICES valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] ... resources: limits: huawei.com/Ascend910: 8 requests: huawei.com/Ascend910: 8 volumeMounts: ... - To configure CPU and memory resources, manually add the cpu and memory parameters and their values by referring to the following example. Set the specific values as required.
... resources: requests: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi limits: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi ... - Modify the mount paths of the training script and code.
The base image pulled from Ascend image repository does not contain files such as the training script and code. During training, these files are usually mounted and mapped to a container.
volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # Path of the training script in the container. - name: data mountPath: /job/data # Path of the training dataset in the container. - name: output mountPath: /job/output # Path of the training output in the container. - As shown below, the three parameters following bash train_start.sh in the YAML file are the directory of the training code in the container, the output directory (contains the generated log redirection file and TensorFlow model file), and the relative path of the startup script to the code directory. The subsequent parameters starting with -- are required by the training script. For details about how to modify the single-server and distributed training scripts and script parameters, see the model description in the model script source.
- TensorFlow command parameters
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;" ...
- PyTorch command parameters
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --world-size=1 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024;" ...
- MindSpore command parameters
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ train.py --config_path=/job/code/config/resnet50_imagenet2012_config.yaml --output_dir=/job/output --run_distribute=True --device_num=8 --data_path=/job/data/imagenet/train" ...
The TensorFlow command parameters are used as an example. - TensorFlow command parameters
- If the NFS is used, specify the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.
... volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # Path of the training script in the container. - name: data mountPath: /job/data # Path of the training dataset in the container. - name: output mountPath: /job/output # Path of the training output in the container. ... volumes: ... - name: code nfs: server: 127.0.0.1 # IP address of the NFS server. path: "xxxxxx" # Training script path. - name: data nfs: server: 127.0.0.1 path: "xxxxxx" # Training dataset path. - name: output nfs: server: 127.0.0.1 path: "xxxxxx" # Set the path for saving the script-related model. ...
Parent topic: Full NPU Scheduling/Static vNPU Scheduling (Training)