训练任务支持cifar10和imagenet2012数据集。其中cifar10数据集已内置,imagenet2012数据集请下载数据集(使用该数据集需按照数据集提供者的使用规范使用),并进行物理机挂载操作。
获取任务yaml请参见yaml下载地址,根据实际情况选择对应的yaml。
yaml名称 |
说明 |
---|---|
infer-volcano.yaml |
推理环境使用Volcano作为调度器 |
train_cifar_vcjob.yaml |
训练环境使用cifar10数据集 |
train_imagenet_vcjob.yaml |
训练环境使用imagenet2012数据集 |
若使用imagenet2012数据集进行训练任务的下发,需要先上传数据集到当前环境上,再挂载imagenet2012数据集。
- name: data hostPath: path: "/data/imagenet" # Configure the path of the training set.
以训练任务train_cifar_vcjob.yaml或train_imagenet_vcjob.yaml为例。
containers: - image: ascendhub.huawei.com/public-ascendhub/mindspore-modelzoo:22.0.0 # Training framework image, which can be modified. imagePullPolicy: IfNotPresent name: mindspore env: - name: mindx-dls-test # The value must be consistent with the value of JobName.
nodeSelector: host-arch: huawei-x86
nodeSelector: host-arch: huawei-arm
spec: minAvailable: 2 # The value of minAvailable is 1 in a single-node scenario and N in an N-node distributed scenario. schedulerName: volcano # Use the Volcano scheduler to schedule jobs. policies: - event: PodEvicted action: RestartJob plugins: ssh: [] env: [] svc: [] maxRetry: 3 queue: default tasks: - name: "default-test" replicas: 2
resources: requests: huawei.com/Ascend910: 8 # Number of required NPUs. The maximum value is 8. You can add lines below to configure resources such as memory and CPU. limits: huawei.com/Ascend910: 8
... resources: requests: huawei.com/Ascend910: 8 cpu: 100m # means 100 milliCPU.For example 100m CPU, 100 milliCPU, and 0.1 CPU are all the same memory: 100Gi # means 100*230 bytes of memory limits: huawei.com/Ascend910: 8 cpu: 100m memory: 100Gi ...
kubectl apply -f train_cifar_vcjob.yaml
kubectl apply -f train_imagenet_vcjob.yaml
在管理节点执行如下命令。
kubectl logs -n 命令空间名称 {pod名字}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
... Train epoch time: 14614.743 ms, per step time: 2435.791 ms epoch: 83 step: 6, loss is 6.024412 Train epoch time: 8068.264 ms, per step time: 1344.711 ms epoch: 84 step: 6, loss is 6.022005 Train epoch time: 6966.450 ms, per step time: 1161.075 ms epoch: 85 step: 6, loss is 6.052724 Train epoch time: 8519.337 ms, per step time: 1419.889 ms epoch: 86 step: 6, loss is 5.9838204
kubectl delete -f train_cifar_vcjob.yaml
kubectl delete -f train_imagenet_vcjob.yaml
configmap "rings-config-mindx-dls-test" deleted job.batch.volcano.sh "mindx-dls-test" deleted
以推理任务infer-volcano.yaml为例。
nodeSelector: host-arch: huawei-x86
nodeSelector: host-arch: huawei-arm
resources: requests: huawei.com/Ascend310P: 1 # Number of required NPUs. The maximum value is 8. You can add lines below to configure resources such as memory and CPU. limits: huawei.com/Ascend310P: 1
... resources: requests: huawei.com/Ascend310P: 1 cpu: 100m # means 100 milliCPU.For example 100m CPU, 100 milliCPU, and 0.1 CPU are all the same memory: 100Gi # means 100*230 bytes of memory limits: huawei.com/Ascend310P: 1 cpu: 100m memory: 100Gi ...
kubectl apply -f infer_vcjob.yaml
kubectl logs -n 命令空间名称 {pod名字}
如:
kubectl logs -n vcjob mindx-dls-test-default-test-0
acl init success set device 0 success create context success create stream success get run mode success dvpp init resource success load model ../model/resnet50_aipp.om success create model description success create model output success model input width 224, input height 224 init sigle op success start to process picture:../data/dog1_1024_683.jpg ...
kubectl delete -f infer_vcjob.yaml
configmap "rings-config-mindx-dls-test" deleted job.batch.volcano.sh "mindx-dls-test" deleted