Switch Affinity Scheduling 2.0

Currently, only PyTorch supports switch affinity scheduling 2.0.

Instructions

  • The switch mentioned in this section is a leaf switch by default. Nodes under a switch can be scheduled by multiple cross-switch jobs.
  • A cross-switch job means that a pod of a job can be deployed on nodes of multiple switches.
  • You can run kubectl describe cm -n volcano-system tor-share-cm to query the status of a switch in a cluster.

    The key fields in ConfigMap are described as follows:

    • IsSharedTor: 0 for an idle switch, 1 for a shared switch, and 2 for an exclusive switch.
    • IsHealthy: 0 for a healthy shared switch and 1 for an unhealthy shared switch.
    • Exclusive switch: Only one cross-switch job exists on the switch, and new cross-switch jobs cannot be scheduled to nodes under the switch.
    • Shared switch: A switch that is used by multiple cross-switch jobs.
      • Healthy shared switch: The number of shared switches used by switch jobs is equal to or less than the maximum number of shared switches in a cluster.
      • Unhealthy shared switch: The number of shared switches used by switch jobs is greater than the maximum number of shared switches in a cluster.

        Foundation model jobs cannot be scheduled to nodes under unhealthy shared switches, while padding jobs and common jobs can.

    • Idle switch: Nodes under the switch have no jobs or only padding jobs.
  • The number of shared switches used by a job cannot exceed the maximum number of shared switches in a cluster.

Switch Affinity for Common Jobs

  • If the cluster resources can meet the requirements of job scheduling logic of a foundation model, use that logic for scheduling.
  • If the cluster resources cannot meet the requirements of job scheduling logic of a foundation model, occupy all idle switches in the cluster, and change the switch attribute to exclusive. Then, the remaining N pods that are not scheduled use the shared switch. The remaining N pods are preferentially scheduled to the nodes under the unhealthy shared switch, and then to the nodes under the shared switch that has only common jobs. If some pods are not scheduled, the job status is changed to Pending.

Switch Affinity for Foundation Model Jobs

Table 1 Node-based affinity policies

Affinity Scheduling Policy

Description

Exclusive switch scheduling policy

Occupy all nodes on the idle switch based on the number of available nodes in descending order until remaining N pods are not scheduled or the nodes on a single switch cannot be fully occupied.

The attribute of the idle switch with nodes fully occupied is changed to exclusive. The unscheduled N pods use the shared switch and comply with the shared switch scheduling policy.

Shared switch scheduling policy

Use the exclusive switch policy. If N pods are not scheduled after all nodes under the idle switch are occupied, use the shared switch scheduling policy.
  • When the number of shared switches that can be used by jobs in a cluster is 1:
    • The shared switch whose number of nodes is closest to N is selected for scheduling.
    • If no shared switch meets the requirement, the idle switch whose number of nodes is closest to N is selected for scheduling, and the attribute of the switch is changed to shared.
  • When the number of shared switches that can be used by jobs in a cluster is 2:
    • Select a shared switch whose number of available nodes or two shared switches whose sum of numbers of available nodes is closest to N for scheduling.
    • If the number of nodes in one switch is the same as that in two switches, the combination of two switches is preferred.
    • If no shared switch meets the requirement, the idle and exclusive switch whose number of nodes is closest to N is selected for scheduling, and the attribute of the switch is changed to shared.

Switch Affinity for Padding Jobs

Cross-switch scheduling is not allowed. Pods can be deployed only on a single switch. Preferentially select the nodes under the exclusive switch whose number of nodes is closest to the number of job pods, then the nodes under the shared switch, and finally the nodes under the idle switch.

Rescheduling Upon Faults

When the node where a job is located or Ascend AI processor is faulty, the job is rescheduled. Before rescheduling, pods of normal nodes will be scheduled to original nodes for training. Pods of faulty nodes will be rescheduled to other nodes. Preferentially select nodes on the exclusive switch used by the job before rescheduling, then other nodes on the shared switch, and finally nodes that are not used.