本章节指导用户配置整卡调度或静态vNPU调度特性的任务YAML,通过环境变量配置资源信息的用户请参考通过环境变量配置资源信息场景;通过文件配置资源信息的用户请参考通过文件配置资源信息场景。
特性名称 |
操作示例 |
---|---|
整卡调度 |
说明:
若需要使用PyTorch或MindSpore框架支持的交换机亲和性调度,配置示例请参见配置交换机亲和性调度参考示例。 |
整卡调度 |
|
整卡调度 |
|
静态vNPU调度 |
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: default-test-tensorflow labels: framework: tensorflow # 训练框架 spec: schedulerName: volcano #当Ascend Operator组件的启动参数enableGangScheduling为true时生效 runPolicy: schedulingPolicy: # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 minAvailable: 1 # 任务总副本数 queue: default # 任务所属队列 successPolicy: AllWorkers # 任务成功的前提 replicaSpecs: Chief: replicas: 1 # 任务副本数 restartPolicy: Never template: spec: nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module #节点类型 containers: - name: ascend # 必须为ascend,不能修改 image: tensorflow-test:latest # 镜像名称 ... env: ... - name: ASCEND_VISIBLE_DEVICES # Ascend Docker Runtime会使用该字段 valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources.requests保持一致 ... ports: # 分布式训练集合通信端口 - containerPort: 2222 name: ascendjob-port resources: limits: huawei.com/Ascend910: 8 # 申请的芯片数量 requests: huawei.com/Ascend910: 8 #与limits取值一致 ...
修改完成后执行步骤2,配置YAML的其他字段。
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: default-test-pytorch labels: framework: pytorch # 镜像名称 tor-affinity: "normal-schema" # 该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不使用该特性。large-model-schema表示大模型任务或填充任务,normal-schema表示普通任务 spec: schedulerName: volcano # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 runPolicy: schedulingPolicy: # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 minAvailable: 1 # 任务总副本数 queue: default # 任务所属队列 successPolicy: AllWorkers # 任务成功的前提 replicaSpecs: Master: replicas: restartPolicy: Never template: spec: nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module # 节点类型 containers: - name: ascend # 必须为ascend,不能修改 image: PyTorch-test:latest # 镜像名称 ... env: ... - name: ASCEND_VISIBLE_DEVICES # Ascend Docker Runtime会使用该字段 valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources.requests保持一致 ... ports: #分布式训练集合通信端口 - containerPort: 2222 name: ascendjob-port resources: limits: huawei.com/Ascend910: 1 # 任务申请的芯片数量 requests: huawei.com/Ascend910: 1 # 与limits取值一致 ...
修改完成后执行步骤2,配置YAML的其他字段。
TensorFlow、PyTorch、MindSpore框架中对应的Chief、Master、Scheduler的“replicas”字段不能超过1。单机任务时,TensorFlow、PyTorch框架不需要Worker。单卡任务时,MindSpore框架不需要Scheduler。
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: default-test-tensorflow # 任务名 labels: framework: tensorflow # 训练框架名称 ring-controller.atlas: ascend-{xxx}b # 标识产品类型 spec: schedulerName: volcano # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 runPolicy: schedulingPolicy: # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 minAvailable: 2 #任务总副本数 queue: default # 任务所属队列 successPolicy: AllWorkers #任务成功的前提 replicaSpecs: Chief: replicas: 1 # 任务副本数 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-{xxx}b # 标识产品类型 spec: affinity: # 本段配置表示分布式任务的Pod调度到不同节点 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: job-name operator: In values: - default-test-tensorflow # 需要和上面的任务名一致 topologyKey: kubernetes.io/hostname nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module-{xxx}b-8 # 节点类型 containers: - name: ascend # 必须为ascend,不能修改 image: tensorflow-test:latest #镜像名称 ... resources: limits: huawei.com/Ascend910: 8 #申请的芯片数量 requests: huawei.com/Ascend910: 8 # 与limits取值一致 volumeMounts: ... volumes: ... Worker: replicas: 1 #任务副本数 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-{xxx}b # 标识产品类型 spec: affinity: # 本段配置表示分布式任务的Pod调度到不同节点 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: job-name operator: In values: - default-test-tensorflow # 需要和上面的任务名一致 topologyKey: kubernetes.io/hostname nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module-{xxx}b-8 # 节点类型 containers: - name: ascend # 必须为ascend,不能修改 ... env: ... - name: ASCEND_VISIBLE_DEVICES # Ascend Docker Runtime会使用该字段 valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources.requests保持一致 ... ports: # 分布式训练集合通信端口 - containerPort: 2222 name: ascendjob-port resources: limits: huawei.com/Ascend910: 8 # 任务申请的芯片数量 requests: huawei.com/Ascend910: 8 # 与limits取值一致 volumeMounts: ... volumes: ...
修改完成后执行步骤2,配置YAML的其他字段。
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: default-test-pytorch labels: framework: pytorch # 框架类型 ring-controller.atlas: ascend-{xxx}b # 标识产品类型 annotations: sp-block: "16" # 需要和申请的芯片数量一致 spec: schedulerName: volcano # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 runPolicy: schedulingPolicy: # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 minAvailable: 1 # 任务总副本数 queue: default # 任务所属队列 successPolicy: AllWorkers # 任务成功的前提 replicaSpecs: Master: replicas: 1 # 任务副本数 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-{xxx}b spec: nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 containers: - name: ascend # 必须为ascend,不能修改 image: pytorch-test:latest # 训练基础镜像 imagePullPolicy: IfNotPresent env: ... - name: ASCEND_VISIBLE_DEVICES # Ascend Docker Runtime会使用该字段 valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] ... ports: # 分布式训练集合通信端口 - containerPort: 2222 # determined by user name: ascendjob-port # do not modify resources: limits: huawei.com/Ascend910: 16 # 任务任务申请的芯片数量 requests: huawei.com/Ascend910: 16 # 与limits取值一致 ...
修改完成后执行步骤2,配置YAML的其他字段。
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: default-test-tensorflow labels: framework: tensorflow # 训练框架 ring-controller.atlas: ascend-910 # 标识产品类型 spec: schedulerName: volcano # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 runPolicy: schedulingPolicy: # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 minAvailable: 1 # 任务总副本数 queue: default # 任务所属队列 successPolicy: AllWorkers # 任务成功的前提 replicaSpecs: Chief: replicas: 1 # 任务副本数 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-910 # 标识产品类型 spec: nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module-{xxx}b-8 # 节点类型 containers: - name: ascend # 必须为ascend,不能修改 image: tensorflow-test:latest # 镜像名称 ... env: ... # 静态vNPU调度暂不支持ASCEND_VISIBLE_DEVICES相关字段,需要删除以下加粗字段 - name: ASCEND_VISIBLE_DEVICES valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] ... ports: # 分布式训练集合通信端口 - containerPort: 2222 name: ascendjob-port resources: limits: huawei.com/Ascend910-2c: 1# vNPU调度此处数量只能为1 requests: huawei.com/Ascend910-2c: 1# vNPU调度此处数量只能为1 volumeMounts: ...
修改完成后执行步骤2,配置YAML的其他字段。
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: default-test-mindspore labels: framework: mindspore # 训练框架名称 ring-controller.atlas: ascend-{xxx}b # 标识产品类型 spec: schedulerName: volcano # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 runPolicy: schedulingPolicy: # 当Ascend Operator组件的启动参数enableGangScheduling为true时生效 minAvailable: 2 #任务总副本数 queue: default # 任务所属队列 successPolicy: AllWorkers #任务成功的前提 replicaSpecs: Scheduler: replicas: 1 # 任务副本数 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-{xxx}b # 标识产品类型 spec: hostNetwork: true # 可选值,根据实际情况填写,true支持hostIP创建Pod,false不支持hostIP创建Pod affinity: # 本段配置表示分布式任务的Pod调度到不同节点 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: job-name operator: In values: - default-test-mindspore # 需要和上面的任务名一致 topologyKey: kubernetes.io/hostname nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module-{xxx}b-8 # 节点类型 containers: - name: ascend # 必须为ascend,不能修改 image: mindspore-test:latest #镜像名称 imagePullPolicy: IfNotPresent ... env: - name: HCCL_IF_IP # 可选值,根据实际情况填写 valueFrom: # 若hostNetwork配置为true,需要同步配置HCCL_IF_IP环境变量 fieldRef: # 若hostNetwork未配置或配置为false,不可配置HCCL_IF_IP环境变量 fieldPath: status.hostIP # ... ports: # 分布式训练集合通信端口 - containerPort: 2222 name: ascendjob-port resources: limits: huawei.com/Ascend910: 8 # 申请的芯片数量 requests: huawei.com/Ascend910: 8 #与limits取值一致 volumeMounts: ... volumes: ... Worker: replicas: 1 #任务副本数 restartPolicy: Never template: metadata: labels: ring-controller.atlas: ascend-{xxx}b # 标识产品类型 spec: hostNetwork: true # 可选值,根据实际情况填写,true支持hostIP创建Pod,false不支持hostIP创建Pod affinity: # 本段配置表示分布式任务的Pod调度到不同节点 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: job-name operator: In values: - default-test-mindspore # 需要和上面的任务名一致 topologyKey: kubernetes.io/hostname nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module-{xxx}b-8 # 节点类型 containers: - name: ascend # 必须为ascend,不能修改 ... env: - name: HCCL_IF_IP # 可选值,根据实际情况填写 valueFrom: # 若hostNetwork配置为true,需要同步配置HCCL_IF_IP环境变量 fieldRef: # 若hostNetwork未配置或配置为false,不可配置HCCL_IF_IP环境变量 fieldPath: status.hostIP # ... - name: ASCEND_VISIBLE_DEVICES # Ascend Docker Runtime会使用该字段 valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources.requests保持一致 ... ports: # 分布式训练集合通信端口 - containerPort: 2222 name: ascendjob-port resources: limits: huawei.com/Ascend910: 8 # 申请的芯片数量 requests: huawei.com/Ascend910: 8 #与limits取值一致 volumeMounts: ... volumes: ...
整卡调度或静态vNPU调度特性配置YAML的操作只在步骤1中有区别,整卡调度和静态vNPU调度特性在步骤1之后的操作相同。
... resources: requests: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi limits: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi ...
从昇腾镜像仓库拉取的基础镜像中不包含训练脚本、代码等文件,训练时通常使用挂载的方式将训练脚本、代码等文件映射到容器内。
volumeMounts: - name: ascend-server-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # 容器中训练脚本路径 - name: data mountPath: /job/data # 容器中训练数据集路径 - name: output mountPath: /job/output # 容器中训练输出路径
command: - /bin/bash - -c args: [ "cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/resnet50/imagenet_TF/ --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export" ] ...
command: - /bin/bash - -c args: ["cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code /job/output main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --epochs=90 --batch-size=512"] ...
command: - /bin/bash - -c args: ["cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code/ /job/code/output train.py --data_path=/job/data/resnet50/imagenet/train --config=/job/code/config/resnet50_imagenet2012_config.yaml"] ...
... volumeMounts: - name: ascend-server-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # 容器中训练脚本路径 - name: data mountPath: /job/data # 容器中训练数据集路径 - name: output mountPath: /job/output # 容器中训练输出路径 ... # 可选,使用组件为训练任务生成RankTable文件,需要新增以下加粗字段,设置容器中hccl.json文件保存路径。该路径不可修改。 - name: ranktable mountPath: /user/serverid/devindex/config ... volumes: ... - name: code nfs: server: 127.0.0.1 # NFS服务器IP地址 path: "xxxxxx" # 配置训练脚本路径 - name: data nfs: server: 127.0.0.1 path: "xxxxxx" # 配置训练集路径 - name: output nfs: server: 127.0.0.1 path: "xxxxxx" # 设置脚本相关配置模型保存路径 ... # 可选,使用组件为PyTorch框架生成RankTable文件,需要新增以下加粗字段,设置hccl.json文件保存路径 - name: ranktable # 请勿修改此参数的默认值,Ascend Operator会用于检查是否开启文件挂载hccl.json。 hostPath: #请使用hostpath挂载或NFS挂载 path: /user/mindx-dl/ranktable/default.default-test-pytorch # 共享存储或者本地存储路径,/user/mindx-dl/ranktable/为前缀路径,必须和Ascend Operator挂载的Ranktable根目录保持一致。default.default-test-pytorch为后缀路径,建议改为:namespace.job-name。
特性名称 |
操作示例 |
---|---|
整卡调度 |
|
整卡调度 |
说明:
若需要使用PyTorch或MindSpore框架支持的交换机亲和性调度,配置示例请参见配置交换机亲和性调度参考示例。 |
静态vNPU调度 |
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-mindx-dls-test # rings-config-后的名字需要与任务名一致 ... labels: ring-controller.atlas: ascend-910 # 标识任务使用的芯片的产品类型 ... --- apiVersion: batch.volcano.sh/v1alpha1 # 不可修改。必须使用Volcano的API。 kind: Job # 目前只支持Job类型 metadata: name: mindx-dls-test # 任务名,可自定义 ... spec: minAvailable: 1 # 单机为1 ... - name: "default-test" replicas: 1 # 单机为1 template: metadata: ... spec: ... containers: - image: tensorflow-test:latest #镜像名称 ... env: ... - name: ASCEND_VISIBLE_DEVICES # Ascend Docker Runtime使用该字段 valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources.requests保持一致 ... resources: requests: huawei.com/Ascend910: 8 # 需要的NPU芯片个数为8。 limits: huawei.com/Ascend910: 8 # 目前需要和上面requests保持一致 ... nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module # 调度到Atlas 800 训练服务器 ...
修改完成后执行步骤2,配置YAML的其他字段。
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-mindx-dls-test # rings-config-后的名字需要与任务名一致 ... labels: ring-controller.atlas: ascend-910 # 标识任务使用的芯片的产品类型 ... --- apiVersion: batch.volcano.sh/v1alpha1 # 不可修改。必须使用Volcano的API kind: Job # 目前只支持Job类型 metadata: name: mindx-dls-test # 任务名,可自定义 ... spec: minAvailable: 2 # 2节点分布式任务则为2,N节点则为N,Deployment类型的任务不需要该参数 ... - name: "default-test" replicas: 2 # N节点分布式场景为N template: metadata: ... spec: affinity: # 本段配置表示分布式任务的Pod调度到不同节点 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: volcano.sh/job-name # vcjob固定字段,当任务类型为deployment时,key为deploy-name operator: In # 固定字段 values: - mindx-dls-test # 需要和上面的任务名一致 topologyKey: kubernetes.io/hostname containers: - image: tensorflow-test:latest # 镜像名称 ... env: ... - name: ASCEND_VISIBLE_DEVICES # Ascend Docker Runtime使用该字段 valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources.requests保持一致 resources: requests: huawei.com/Ascend910: 8 # 需要的NPU芯片个数为8。可在下方添加行,配置memory、cpu等资源 limits: huawei.com/Ascend910: 8 # 目前需要和上面requests保持一致 ... nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module # 调度到Atlas 800 训练服务器 ...
修改完成后执行步骤2,配置YAML的其他字段。
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-mindx-dls-test # rings-config-后的名字需要与任务名一致 ... labels: ring-controller.atlas: ascend-{xxx}b # 产品类型 .. --- apiVersion: batch.volcano.sh/v1alpha1 # 不可修改,必须使用Volcano的API kind: Job # 目前只支持Job类型 metadata: name: mindx-dls-test # 任务名字 ... labels: ring-controller.atlas: ascend-{xxx}b # 必须与ConfigMap中的标签保持一致,不可修改 ... spec: minAvailable: 2 # 此处建议与下面的为节点个数保持一致 schedulerName: volcano # 使用Volcano进行调度 ... tasks: - name: "default-test" replicas: 2 # 此处为节点个数 template: metadata: labels: app: tf ring-controller.atlas: ascend-{xxx}b # 必须与ConfigMap中的标签一致,不可修改 spec: affinity: # 本段配置表示分布式任务的Pod调度到不同节点 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: volcano.sh/job-name # vcjob固定字段,当任务类型为deployment时,key为deploy-name operator: In # 固定字段 values: - mindx-dls-test # 需要和上面的任务名一致 topologyKey: kubernetes.io/hostname containers: - image: tensorflow-test:latest # 训练框架镜像,根据实际情况修改 ... env: ... - name: XDL_IP # 本段固定不变 valueFrom: fieldRef: fieldPath: status.hostIP - name: framework value: "Tensorflow" # 根据实际框架变化进行修改 - name: ASCEND_VISIBLE_DEVICES # 会使用该字段 valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources.requests保持一致 ... resources: requests: huawei.com/Ascend910: 8 # 每台Atlas 800T A2 训练服务器芯片数量最多为8 limits: huawei.com/Ascend910: 8 # 每台Atlas 800T A2 训练服务器芯片数量最多为8 ... nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module-{xxx}b-8 # 调度到Atlas 800T A2 训练服务器节点 ...
修改完成后执行步骤2,配置YAML的其他字段。
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-mindx-dls-test # rings-config-后的名字需要与任务名一致 namespace: vcjob labels: ring-controller.atlas: ascend-{xxx}b # 产品类型 data: hccl.json: | { "status":"initializing" } --- apiVersion: batch.volcano.sh/v1alpha1 # 不可修改,必须使用Volcano的API kind: Job # 目前只支持Job类型 metadata: ... labels: ring-controller.atlas: ascend-{xxx}b # 必须与ConfigMap中的标签保持一致,不可修改 fault-scheduling: "force" tor-affinity: "normal-schema" # 该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不使用该特性。large-model-schema表示大模型任务或填充任务,normal-schema表示普通任务 spec: minAvailable: 1 # 此处建议与下面的为节点个数保持一致 schedulerName: volcano # 使用Volcano进行调度 ... tasks: - name: "default-test" replicas: 1 # 此处为节点个数 template: metadata: labels: app: pytorch ring-controller.atlas: ascend-{xxx}b # 必须与ConfigMap中的标签一致,不可修改 spec: affinity: # 本段配置表示分布式任务的Pod调度到不同节点 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: volcano.sh/job-name operator: In values: - mindx-dls-test topologyKey: kubernetes.io/hostname hostNetwork: true containers: - image: torch:b030 # 训练框架镜像,根据实际情况修改 - name: XDL_IP # 本段固定不变 valueFrom: fieldRef: fieldPath: status.hostIP - name: framework value: "PyTorch" ... - name: ASCEND_VISIBLE_DEVICES valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] # 需要和下面resources.requests保持一致 ... resources: requests: huawei.com/Ascend910: 8 # 每台Atlas 800T A2 训练服务器芯片数量最多为8 limits: huawei.com/Ascend910: 8 # 每台Atlas 800T A2 训练服务器芯片数量最多为8 ... nodeSelector: host-arch: huawei-x86 # 可选值,根据实际情况填写 accelerator-type: module-{xxx}b-8 #调度到Atlas 800T A2 训练服务器节点 ...
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-mindx-dls-test # rings-config-后的名字需要与任务名一致 ... labels: ring-controller.atlas: ascend-910 # 产品类型 ... --- apiVersion: batch.volcano.sh/v1alpha1 # 不可修改,必须使用Volcano的API kind: Job #目前只支持Job类型 metadata: name: mindx-dls-test # 任务名,可自定义 ... spec: minAvailable: 1 # vNPU调度此处需要为1 ... - name: "default-test" replicas: 1 # vNPU调度此处需要为1 template: metadata: ... spec: ... containers: - image: tensorflow-test:latest # 训练镜像 ... env: ... # 静态vNPU调度暂不支持ASCEND_VISIBLE_DEVICES相关字段,需要删除以下加粗字段 - name: ASCEND_VISIBLE_DEVICES valueFrom: fieldRef: fieldPath: metadata.annotations['huawei.com/Ascend910'] ... resources: requests: huawei.com/Ascend910-2c: 1 # vNPU调度此处数量只能为1 limits: huawei.com/Ascend910-2c: 1 # vNPU调度此处数量只能为1 ... nodeSelector: host-arch: huawei-arm # 可选值,根据实际情况填写 accelerator-type: module # 调度到Atlas 800 训练服务器上 ...
修改完成后执行步骤2,配置YAML的其他字段。
整卡调度或静态vNPU调度特性配置YAML的操作只在步骤1中有区别,整卡调度和静态vNPU调度特性在步骤1之后的操作相同。
... resources: requests: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi limits: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi ...
从昇腾镜像仓库拉取的基础镜像中不包含训练脚本、代码等文件,训练时通常使用挂载的方式将训练脚本、代码等文件映射到容器内。
volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # 容器中训练脚本路径 - name: data mountPath: /job/data # 容器中训练数据集路径 - name: output mountPath: /job/output # 容器中训练输出路径
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;" ...
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --world-size=1 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024;" ...
command: - "/bin/bash" - "-c" - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ train.py --config_path=/job/code/config/resnet50_imagenet2012_config.yaml --output_dir=/job/output --run_distribute=True --device_num=8 --data_path=/job/data/imagenet/train" ...
... volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config - name: code mountPath: /job/code # 容器中训练脚本路径 - name: data mountPath: /job/data # 容器中训练数据集路径 - name: output mountPath: /job/output # 容器中训练输出路径 ... volumes: ... - name: code nfs: server: 127.0.0.1 # NFS服务器IP地址 path: "xxxxxx" # 配置训练脚本路径 - name: data nfs: server: 127.0.0.1 path: "xxxxxx" # 配置训练集路径 - name: output nfs: server: 127.0.0.1 path: "xxxxxx" # 设置脚本相关配置模型保存路径 ...