Switch Affinity Scheduling 2.0
Currently, only PyTorch supports switch affinity scheduling 2.0.
Instructions
- The switch mentioned in this section is a leaf switch by default. Nodes under a switch can be scheduled by multiple cross-switch jobs.
- A cross-switch job means that a pod of a job can be deployed on nodes of multiple switches.
- You can run kubectl describe cm -n volcano-system tor-share-cm to query the status of a switch in a cluster.
The key fields in ConfigMap are described as follows:
- IsSharedTor: 0 for an idle switch, 1 for a shared switch, and 2 for an exclusive switch.
- IsHealthy: 0 for a healthy shared switch and 1 for an unhealthy shared switch.
- Exclusive switch: Only one cross-switch job exists on the switch, and new cross-switch jobs cannot be scheduled to nodes under the switch.
- Shared switch: A switch that is used by multiple cross-switch jobs.
- Healthy shared switch: The number of shared switches used by switch jobs is equal to or less than the maximum number of shared switches in a cluster.
- Unhealthy shared switch: The number of shared switches used by switch jobs is greater than the maximum number of shared switches in a cluster.
Foundation model jobs cannot be scheduled to nodes under unhealthy shared switches, while padding jobs and common jobs can.
- Idle switch: Nodes under the switch have no jobs or only padding jobs.
- The number of shared switches used by a job cannot exceed the maximum number of shared switches in a cluster.
Switch Affinity for Common Jobs
- If the cluster resources can meet the requirements of job scheduling logic of a foundation model, use that logic for scheduling.
- If the cluster resources cannot meet the requirements of job scheduling logic of a foundation model, occupy all idle switches in the cluster, and change the switch attribute to exclusive. Then, the remaining N pods that are not scheduled use the shared switch. The remaining N pods are preferentially scheduled to the nodes under the unhealthy shared switch, and then to the nodes under the shared switch that has only common jobs. If some pods are not scheduled, the job status is changed to Pending.
Switch Affinity for Foundation Model Jobs
Affinity Scheduling Policy |
Description |
|---|---|
Exclusive switch scheduling policy |
Occupy all nodes on the idle switch based on the number of available nodes in descending order until remaining N pods are not scheduled or the nodes on a single switch cannot be fully occupied. The attribute of the idle switch with nodes fully occupied is changed to exclusive. The unscheduled N pods use the shared switch and comply with the shared switch scheduling policy. |
Shared switch scheduling policy |
Use the exclusive switch policy. If N pods are not scheduled after all nodes under the idle switch are occupied, use the shared switch scheduling policy.
|
Switch Affinity for Padding Jobs
Cross-switch scheduling is not allowed. Pods can be deployed only on a single switch. Preferentially select the nodes under the exclusive switch whose number of nodes is closest to the number of job pods, then the nodes under the shared switch, and finally the nodes under the idle switch.
Rescheduling Upon Faults
When the node where a job is located or Ascend AI processor is faulty, the job is rescheduled. Before rescheduling, pods of normal nodes will be scheduled to original nodes for training. Pods of faulty nodes will be rescheduled to other nodes. Preferentially select nodes on the exclusive switch used by the job before rescheduling, then other nodes on the shared switch, and finally nodes that are not used.