Single-Layer Switch Affinity Scheduling
Instructions
- The single-layer switch affinity scheduling supports only distributed inference jobs.
- The total number of job replicas cannot exceed the maximum number of nodes on a single switch.
- Jobs can be deployed only on one switch.
- If the job requirements are met, preferentially select the nodes under the switch with fewer remaining nodes.
Rescheduling Upon Faults
When the node where a job is located or Ascend AI processor is faulty, the job is rescheduled. Before rescheduling, pods that run normally will be scheduled to original nodes for training. Pods of faulty nodes will be rescheduled to other nodes.
Parent topic: Node-based Affinity