通过命令行使用(其他调度器)和通过命令行使用(Volcano)使用流程一致,只有任务YAML有所不同,用户可以准备好相应YAML后参考通过命令行使用(Volcano)章节使用。
集群调度并未专门提供使用其他调度器的YAML示例,用户可以获取使用Volcano的YAML示例并做如下修改即可使用。
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: default-test-tensorflow labels: framework: tensorflow ring-controller.atlas: ascend-{xxx}b spec: schedulerName: volcano # 使用其他调度器时,删除该字段 runPolicy: # 使用其他调度器时,删除该字段 schedulingPolicy: minAvailable: 1 queue: default successPolicy: AllWorkers replicaSpecs: Chief: replicas: 1 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-{xxx}b spec: nodeSelector: host-arch: huawei-arm accelerator-type: module-{xxx}b-8 containers: - name: ascend ... env: ... # 使用其他调度器暂不支持ASCEND_VISIBLE_DEVICES相关字段,需要删除以下加粗字段 - name: ASCEND_VISIBLE_DEVICES valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] ... resources: limits: huawei.com/Ascend910: 8 requests: huawei.com/Ascend910: 8 volumeMounts: ...
... resources: requests: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi limits: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi ...
从昇腾镜像仓库拉取的基础镜像中不包含训练脚本、代码等文件,训练时通常使用挂载的方式将训练脚本、代码等文件映射到容器内。
volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # 容器中训练脚本路径 - name: data mountPath: /job/data # 容器中训练数据集路径 - name: output mountPath: /job/output # 容器中训练输出路径
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;" ...
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --world-size=1 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024;" ...
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ train.py --config_path=/job/code/config/resnet50_imagenet2012_config.yaml --output_dir=/job/output --run_distribute=True --device_num=8 --data_path=/job/data/imagenet/train" ...
... volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # 容器中训练脚本路径 - name: data mountPath: /job/data # 容器中训练数据集路径 - name: output mountPath: /job/output # 容器中训练输出路径 ... volumes: ... - name: code nfs: server: 127.0.0.1 # NFS服务器IP地址 path: "xxxxxx" # 配置训练脚本路径 - name: data nfs: server: 127.0.0.1 path: "xxxxxx" # 配置训练集路径 - name: output nfs: server: 127.0.0.1 path: "xxxxxx" # 设置脚本相关配置模型保存路径 ...