Affinity Scheduling Interconnection

To decouple the scheduling layer from the task resource type, the scheduling plugin Ascend-for-volcano supports the configuration of pod-level scheduling policy. You can configure scheduling parameters in metadata.labels or metadata.annotations of a pod, without depending on PodGroup. The pod types supported include acjob, vcjob, Job, Deployment, and StatefulSet.

Function Description

You can add a specific label or annotation to the pod template of Kubernetes resources to control core scheduling behavior of Volcano, including but not limited to the following:

  • Ascend AI processor-based affinity scheduling
  • Switch affinity scheduling
  • Affinity scheduling of logical SuperPoDs
  • Rescheduling upon faults

Prerequisite

Ensure that the Kubernetes cluster has been correctly deployed, Volcano has been configured, and Ascend-for-volcano has been enabled.

Example of Scheduling Policy Configuration

Take StatefulSet as an example. All labels and annotations related to scheduling must be configured under StatefulSet.spec.template.metadata to ensure that the scheduler can correctly read the labels and annotations from the pod instance.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mindx-dls-test               # The value of this parameter must be consistent with the name of ConfigMap.
  labels:
    app: mindspore
    ring-controller.atlas: ascend-910
spec:
  replicas: 16                        # The value of replicas is 1 in a single-node scenario and N in an N-node scenario. The number of NPUs in the requests field is 8 in an N-node scenario.
  podManagementPolicy: Parallel   # The OrderdReady and Parallel modes both are supported. OrderdReady supports only intra-node affinity scheduling, and huawei.com/schedule_minAvailable can only be set to 1. Parallel supports intra-node and inter-node affinity scheduling.
  serviceName: service-headliness
  selector:
    matchLabels:
      app: mindspore
  template:
    metadata:
      labels:
         app: mindspore
         ring-controller.atlas: ascend-910
        fault-scheduling: force    # Scheduling upon faults
        pod-rescheduling: "on"   # Pod-level rescheduling
         fault-retry-times: "85"    # Number of rescheduling times when a service plane fault occurs
         tor-affinity: large-model-schema  # Switch affinity scheduling
         deploy-name: mindx-dls-test # This label must be added to generate RankTable. The value must be the same as the task name.
      annotations:
        sp-block: "128"           # Affinity scheduling of logical SuperPoDs
        huawei.com/recover_policy_path: pod   # Pod-level rescheduling
        huawei.com/schedule_minAvailable: "16" # Minimum number of replicas for job scheduling. It is recommended that the value be the same as the number of job replicas.
    spec:
      schedulerName: volcano         # Use the Volcano scheduler to schedule jobs.
      nodeSelector:
        host-arch: huawei-arm        # Configure the label based on the actual job.
      containers:
        - image: ubuntu:18.04      # Training framework image, which can be modified.
          name: mindspore
          resources:
            requests:
              huawei.com/Ascend910: 16                                               # Number of required NPUs. The maximum value is 16. You can add lines below to configure resources such as memory and CPU
            limits:
              huawei.com/Ascend910: 16                                                # The value must be consistent with that in requests.
  • If a PodGroup is created, the scheduling configuration in spec overwrites the labels/annotations of its generated pod.
  • For resources that can generate PodGroups, you can configure the corresponding scheduling policy in PodGroups to implement affinity scheduling.
  • For details about the common labels and annotations, see PodGroup or Pod.