重调度模式和优雅容错模式都可使用该配置示例。
apiVersion: v1
kind: ConfigMap
metadata:
name: rings-config-mindx-dls-test # rings-config-后的名字需要与任务名一致
...
labels:
ring-controller.atlas: ascend-910
...
---
apiVersion: batch.volcano.sh/v1alpha1 # 不可修改。必须使用Volcano的API。
kind: Job # 目前只支持Job类型
metadata:
name: mindx-dls-test # 任务名,可自定义
labels:
ring-controller.atlas: ascend-910
fault-scheduling: "force" # 开启强制删除模式
fault-retry-times: "3" # 开启业务面故障无条件重试能力,同时需要将restartPolicy取值设置为Never
tor-affinity: "normal-schema" #该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不使用该特性。large-model-schema表示大模型任务或填充任务,normal-schema表示普通任务
...
spec:
minAvailable: 1 # 单机为1
...
maxRetry: 3 # 重调度次数
...
- name: "default-test"
replicas: 1 # 单机为1
template:
metadata:
...
spec:
terminationGracePeriodSeconds: 360
...
env:
...
- name: ASCEND_VISIBLE_DEVICES # MindCluster Ascend Docker Runtime使用该字段
valueFrom:
fieldRef:
fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources和requests保持一致
...
resources:
requests:
huawei.com/Ascend910: 8 # 需要的NPU芯片个数为8。可在下方添加行,配置memory、cpu等资源
limits:
huawei.com/Ascend910: 8 # 目前需要和上面requests保持一致
...
nodeSelector:
host-arch: huawei-arm # 可选值,根据实际情况填写
accelerator-type: module
...
restartPolicy: Never
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: default-test-mindspore labels: framework: mindspore fault-scheduling: "grace" # 开启优雅删除模式 ring-controller.atlas: ascend-{xxx}b fault-retry-times: "3" # 开启业务面故障无条件重试能力,同时需要将restartPolicy取值设置为Never tor-affinity: "normal-schema" #该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不使用该特性。large-model-schema表示大模型任务或填充任务,normal-schema表示普通任务 spec: schedulerName: volcano # 当MindCluster Ascend Operator组件的启动参数enableGangScheduling为true时生效 runPolicy: backoffLimit: 3 任务重调度次数 schedulingPolicy: minAvailable: 3 # 当MindCluster Ascend Operator组件的启动参数enableGangScheduling为true时生效 queue: default successPolicy: AllWorkers replicaSpecs: Scheduler: replicas: 1 #只能为1 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-{xxx}b spec: terminationGracePeriodSeconds: 360 nodeSelector: host-arch: huawei-x86 # Atlas 200T A2 Box16 异构子框只有x86_64架构 accelerator-type: module-{xxx}b-16 # 节点类型 containers: - name: ascend # 不能修改 ... ports: - containerPort: 2222 name: ascendjob-port volumeMounts: ... Worker: replicas: 2 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-{xxx}b spec: terminationGracePeriodSeconds: 360 affinity: ... nodeSelector: host-arch: huawei-x86 # Atlas 200T A2 Box16 异构子框只有x86_64架构 accelerator-type: module-{xxx}b-16 # 节点类型 containers: - name: ascend # 不能修改 ... env: - name: ASCEND_VISIBLE_DEVICES valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources和requests保持一致 ... ports: - containerPort: 2222 name: ascendjob-port resources: limits: huawei.com/Ascend910: 4 requests: huawei.com/Ascend910: 4
... volumeMounts: #断点续训扩容 - name: shm mountPath: /dev/shm volumes: - name: shm emptyDir: medium: Memory sizeLimit: 16Gi ...
... resources: requests: huawei.com/Ascend910: 8 cpu: 100m # means 100 milliCPU.For example 100m CPU, 100 milliCPU, and 0.1 CPU are all the same memory: 100Gi # means 100*230 bytes of memory limits: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi ...
从昇腾镜像仓库拉取的基础镜像中不包含训练脚本、代码等文件,训练时通常使用挂载的方式将训练脚本、代码等文件映射到容器内。
volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # 容器中训练脚本路径 - name: data mountPath: /job/data # 容器中训练数据集路径 - name: output mountPath: /job/output # 容器中训练输出路径
使用优雅容错模式可跳过该步骤。
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;" ...
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024 --resume=true;" ...
... volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # 容器中训练脚本路径 - name: data mountPath: /job/data # 容器中训练数据集路径 - name: output mountPath: /job/output # 容器中训练输出路径 ... volumes: ... - name: code nfs: server: 127.0.0.1 # NFS服务器IP地址 path: "xxxxxx" # 配置训练脚本路径 - name: data nfs: server: 127.0.0.1 path: "xxxxxx" # 配置训练集路径 - name: output nfs: server: 127.0.0.1 path: "xxxxxx" # 设置脚本相关模型的保存路径 ...