昇腾社区首页
中文
注册
开发者
下载

Ascend Job

Ascend Job:简称acjob,是MindCluster自定义的一种任务类型,当前支持通过环境变量配置资源信息及文件配置资源信息这2种方式拉起训练或推理任务。

支持的AI框架

  • MindSpore
  • TensorFlow
  • PyTorch

样例

pytorch_multinodes_acjob_910b.yaml示例如下。

apiVersion: mindxdl.gitee.com/v1
kind: AscendJob
metadata:
  name: default-test-pytorch
  labels:
    framework: pytorch
    ring-controller.atlas: ascend-910b
    tor-affinity: "null" #该标签为任务是否使用交换机亲和性调度标签,null或者不写该标签则不适用。large-model-schema表示大模型任务,normal-schema 普通任务
spec:
  schedulerName: volcano   # work when enableGangScheduling is true
  runPolicy:
    schedulingPolicy:      # work when enableGangScheduling is true
      minAvailable: 2
      queue: default
  successPolicy: AllWorkers
  replicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          labels:
            ring-controller.atlas: ascend-910b
        spec:
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                      - key: job-name
                        operator: In
                        values:
                          - default-test-pytorch
                  topologyKey: kubernetes.io/hostname
          nodeSelector:
            host-arch: huawei-arm
            accelerator-type: card-910b-2 # depend on your device model, 910bx8 is module-910b-8 ,910bx16 is module-910b-16
          containers:
          - name: ascend # do not modify
            image: pytorch-test:latest         # trainning framework image, which can be modified
            imagePullPolicy: IfNotPresent
            env:
              - name: XDL_IP                                       # IP address of the physical node, which is used to identify the node where the pod is running
                valueFrom:
                  fieldRef:
                    fieldPath: status.hostIP
              # ASCEND_VISIBLE_DEVICES env variable is used by ascend-docker-runtime when in the whole card scheduling scene with volcano scheduler. 
              # Please delete it when in the static vNPU scheduling, dynamic vNPU scheduling, volcano without Ascend-volcano-plugin, without volcano scenes.
              - name: ASCEND_VISIBLE_DEVICES
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be the same as resources.requests
            command:                           # training command, which can be modified
              - /bin/bash
              - -c
            args: [ "cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code /job/output main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=4096" ]
            ports:                          # default value containerPort: 2222 name: ascendjob-port if not set
              - containerPort: 2222         # determined by user
                name: ascendjob-port        # do not modify
            resources:
              limits:
                huawei.com/Ascend910: 2
              requests:
                huawei.com/Ascend910: 2
            volumeMounts:
            - name: code
              mountPath: /job/code
            - name: data
              mountPath: /job/data
            - name: output
              mountPath: /job/output
            - name: ascend-driver
              mountPath: /usr/local/Ascend/driver
            - name: ascend-add-ons
              mountPath: /usr/local/Ascend/add-ons
            - name: dshm
              mountPath: /dev/shm
            - name: localtime
              mountPath: /etc/localtime
          volumes:
          - name: code
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/"
          - name: data
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/public/dataset/"
          - name: output
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/output/"
          - name: ascend-driver
            hostPath:
              path: /usr/local/Ascend/driver
          - name: ascend-add-ons
            hostPath:
              path: /usr/local/Ascend/add-ons
          - name: dshm
            emptyDir:
              medium: Memory
          - name: localtime
            hostPath:
              path: /etc/localtime
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          labels:
            ring-controller.atlas: ascend-910b
        spec:
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                      - key: job-name
                        operator: In
                        values:
                          - default-test-pytorch
                  topologyKey: kubernetes.io/hostname
          nodeSelector:
            host-arch: huawei-arm
            accelerator-type: card-910b-2 # depend on your device model, 910bx8 is module-910b-8 ,910bx16 is module-910b-16
          containers:
          - name: ascend # do not modify
            image: pytorch-test:latest                # trainning framework image, which can be modified
            imagePullPolicy: IfNotPresent
            env:
              - name: XDL_IP                                       # IP address of the physical node, which is used to identify the node where the pod is running
                valueFrom:
                  fieldRef:
                    fieldPath: status.hostIP
              # ASCEND_VISIBLE_DEVICES env variable is used by ascend-docker-runtime when in the whole card scheduling scene with volcano scheduler. 
          # Please delete it when in the static vNPU scheduling, dynamic vNPU scheduling, volcano without Ascend-volcano-plugin, without volcano scenes.
              - name: ASCEND_VISIBLE_DEVICES
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be the same as resources.requests
            command:                                  # training command, which can be modified
              - /bin/bash
              - -c
            args: ["cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code /job/output main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=4096"]
            ports:                          # default value containerPort: 2222 name: ascendjob-port if not set
              - containerPort: 2222         # determined by user
                name: ascendjob-port        # do not modify
            resources:
              limits:
                huawei.com/Ascend910: 2
              requests:
                huawei.com/Ascend910: 2
            volumeMounts:
            - name: code
              mountPath: /job/code
            - name: data
              mountPath: /job/data
            - name: output
              mountPath: /job/output
            - name: ascend-driver
              mountPath: /usr/local/Ascend/driver
            - name: ascend-add-ons
              mountPath: /usr/local/Ascend/add-ons
            - name: dshm
              mountPath: /dev/shm
            - name: localtime
              mountPath: /etc/localtime
          volumes:
          - name: code
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/"
          - name: data
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/public/dataset/"
          - name: output
            nfs:
              server: 127.0.0.1
              path: "/data/atlas_dls/output/"
          - name: ascend-driver
            hostPath:
              path: /usr/local/Ascend/driver
          - name: ascend-add-ons
            hostPath:
              path: /usr/local/Ascend/add-ons
          - name: dshm
            emptyDir:
              medium: Memory
          - name: localtime
            hostPath:
              path: /etc/localtime

关键字段

表1 acjob字段说明

字段路径

类型

格式

描述

apiVersion

字符串 (string)

-

定义对象表示的版本化资源模式。服务器会转换为最新内部值,拒绝不识别的版本。 更多信息请参见Types

kind

字符串 (string)

-

表示此对象对应的REST资源类型。值通过端点推断,不可更新,采用驼峰命名。 更多信息请参见Resources

metadata

对象 (object)

-

Kubernetes元数据(如命名空间、标签等)。更多信息请参见Metadata

spec

对象 (object)

-

AscendJob期望状态的规格描述。必填字段:replicaSpecs。

spec.replicaSpecs

对象 (object)

-

ReplicaType到ReplicaSpec的映射,指定MS集群配置。示例:{ "Scheduler": ReplicaSpec, "Worker": ReplicaSpec }。

spec.replicaSpecs.[ReplicaType]

对象 (object)

-

副本的描述。

spec.replicaSpecs.[ReplicaType].replicas

整数 (integer)

int32

副本数量,表示给定模板所需的副本数。默认为1。

spec.replicaSpecs.[ReplicaType].restartPolicy

字符串 (string)

-

重启策略:Always、OnFailure、Never、ExitCode。默认为Never。

spec.replicaSpecs.[ReplicaType].template

对象 (object)

-

Kubernetes Pod模板,更多信息请参见Kubernetes Pod模板

spec.runPolicy

对象 (object)

-

封装分布式训练作业的运行时策略(如资源清理、活动时间)。

spec.runPolicy.backoffLimit

整数 (integer)

int32

作业失败前允许的重试次数(可选)。

spec.runPolicy.activeDeadlineSeconds

整数 (integer)

int64

作业保持活动的最长时间(秒),值必须为正整数。当前无意义,后续版本将会删除。

spec.runPolicy.cleanPodPolicy

字符串 (string)

-

作业完成后清理Pod的策略。默认值为Running。当前无意义,后续版本将会删除。

spec.runPolicy.ttlSecondsAfterFinished

整数 (integer)

int32

作业完成后的TTL(生存时间)。默认为无限,实际删除可能延迟。当前无意义,后续版本将会删除。

spec.runPolicy.schedulingPolicy

对象 (object)

-

调度策略(如gang-scheduling)。

spec.runPolicy.schedulingPolicy.minAvailable

整数 (integer)

int32

最小可用资源数。

spec.runPolicy.schedulingPolicy.minResources

对象 (object)

-

按资源名称分配的最小资源集合(支持整数或字符串格式)。

spec.runPolicy.schedulingPolicy.priorityClass

字符串 (string)

-

优先级类名称。

spec.runPolicy.schedulingPolicy.queue

字符串 (string)

-

调度队列名称。

spec.schedulerName

字符串 (string)

-

指定在开启gang-scheduling情况下的调度器,当前仅支持Volcano。

spec.successPolicy

字符串 (string)

-

标记AscendJob成功的标准,当前无意义,仅当所有Pod成功时,才会判定任务成功。后续版本将会删除。

status

对象 (object)

-

AscendJob的最新观察状态(只读)。必填字段:conditions、replicaStatuses。

status.completionTime

字符串 (string)

date-time

作业完成时间(RFC3339格式,UTC)。

status.conditions

数组 (array)

-

当前作业条件数组。

status.conditions[type]

字符串 (string)

-

作业条件的类型(如 "Complete")。

status.conditions[status]

字符串 (string)

-

条件状态:True、False、Unknown。

status.conditions[lastTransitionTime]

字符串 (string)

date-time

条件状态转换的时间。

status.conditions[lastUpdateTime]

字符串 (string)

date-time

条件更新后的最终时间。

status.conditions[message]

字符串 (string)

-

条件的详细描述。

status.conditions[reason]

字符串 (string)

-

条件转换的原因。

status.lastReconcileTime

字符串 (string)

date-time

作业最后一次调和的时间(RFC3339格式,UTC)。

status.replicaStatuses

对象 (object)

-

副本类型到副本状态的映射。

status.replicaStatuses.[ReplicaType].active

整数 (integer)

int32

正在运行的Pod数量。

status.replicaStatuses.[ReplicaType].failed

整数 (integer)

int32

已失败的Pod数量。

status.replicaStatuses.[ReplicaType].succeeded

整数 (integer)

int32

已成功的Pod数量。

status.replicaStatuses.[ReplicaType].labelSelector

对象 (object)

-

Pod标签选择器(定义如何筛选Pod)。

status.replicaStatuses.[ReplicaType].labelSelector.matchExpressions

数组 (array)

-

标签匹配规则(支持In、NotIn、Exists、DoesNotExist等操作符)。

status.replicaStatuses.[ReplicaType].labelSelector.matchLabels

对象 (object)

-

标签匹配的键值对(等价于matchExpressions条件)。

status.startTime

字符串 (string)

date-time

作业开始时间(RFC3339格式,UTC)。

任务状态说明

拉起训练任务后,用户可以通过kubectl get acjob命令查看acjob任务的运行状态,当前运行状态有以下几种。

表2 acjob任务运行状态说明

状态名称

说明

Created

Job已经创建,但其中一个或多个子资源(Pod/Service)尚未就绪。

Running

Job的所有子资源(Pod/Service)已经调度并启动。

Restarting

Job的一个或多个子资源(Pod/Service)运行失败,但是根据重启策略正在重新启动。

Succeeded

Job的所有子资源(Pod/Service)处于成功终止阶段。

Failed

Job的一个或多个子资源(Pod/Service)运行失败。