Affinity Scheduling Description

Affinity scheduling maximizes the computing power of Ascend AI processors by reducing resource fragments and network congestion.

Resource fragments
After a job is deployed, deploy the remaining AscendAI processors in units with smoother network connections (such as nodes, SuperPoDs, or nodes under a single switch). This prevents job scheduling failures due to scattered resources even when the total number of AscendAI processors is sufficient.
Network congestion
AscendAI processors can be connected in multiple modes. The interconnection mode varies depending on networking modes of different products, resulting in different network bandwidth. You can select a proper scheduling policy based on the interconnection mode of AscendAI processors to reduce network congestion.

Ascend AI Processor-based Affinity Scheduling

There are three processor connection modes in hardware products. Regarding scheduling priority, a job is preferentially scheduled to the Ascend AI processor within the same inference card or training card, then to the Ascend AI processor interconnected through HCCS, and finally to the Ascend AI processor interconnected through PCIe.

Huawei Cache Coherence System (HCCS) is the hardware form of Huawei Collective Communication Library (HCCL) that facilitates high-performance communication between servers in deep learning training scenarios.

Figure 1 Ascend AI processor interconnection modes

Different hardware products may use one or more of the three interconnection modes. The following table describes scheduling policies in detail.

**Table 1** Ascend AI processor-based affinity scheduling
Product	Ascend AI Processor Interconnection Mode	Method to Reduce Network Congestion	Method to Reduce Resource Fragments
Atlas training product	Four Ascend AI processors are interconnected through HCCS, and Ascend AI processors in HCCS rings are interconnected through PCIe.	Allocate the job with four or fewer Ascend AI processors to one HCCS ring.	If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.
Atlas 200T A2 Box16 heterogeneous subrack Atlas 200I A2 Box16 heterogeneous subrack	Eight Ascend AI processors are interconnected through HCCS, and Ascend AI processors in HCCS rings are interconnected through PCIe.	Allocate the job with eight or fewer Ascend AI processors to one HCCS ring. Evenly allocate the job with more than eight Ascend AI processors to two rings.	If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.
Atlas 900 A3 SuperPoD A200T A3 Box8 SuperPoD Server Atlas 800I A3 SuperPoD Server Atlas 800T A3 SuperPoD Server	Two Ascend AI processors form eight HiAM modules through SIO, and each HiAM module is interconnected through HCCS.	If the number of Ascend AI processors is an even number, they must be scheduled to one HiAM module.	-
Atlas 800 inference server (model 3000) (with Atlas 300I inference cards)	Each inference card has four interconnected Ascend AI processors, but inference cards are not interconnected.	If the number of allocated Ascend AI processors is less than 4 and scheduling is performed by inference card, the job must be scheduled to one inference card.	If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.
Atlas 800 inference server (model 3000) (with Atlas 300I Duo inference cards)	Two Ascend AI processors in each inference card are interconnected through HCCS, and inference cards are interconnected through PCIe.	In distributed inference scheduling, a job must be scheduled to the entire Atlas 300I Duo inference card. If the number of Ascend AI processors required by a job is an odd number, the job is preferentially scheduled to the Atlas 300I Duo inference card whose number of remaining Ascend AI processors is 1.	If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.

Node-based Affinity Scheduling

Nodes are connected through the RoCE network or interconnect device + RoCE network. The interconnect device network is preferentially used during job scheduling. The RoCE network uses the spine-leaf architecture. The network traffic is preferentially controlled at the leaf layer. If spine must be used, ensure that traffic is evenly distributed to each spine layer.

Products using RoCE connections: Atlas 800T A2 training server, Atlas 800I A2 inference server, A200I A2 Box heterogeneous component, Atlas 200T A2 Box16 heterogeneous subrack, Atlas 200I A2 Box16 heterogeneous subrack, Atlas 800 training server (model 9000), and Atlas 800 training server (model 9010)
Products that use single-layer RoCE connections: Atlas 800I A2 inference server and A200I A2 Box heterogeneous component
Products using UnifiedBus + RoCE connections: Atlas 900 A3 SuperPoD

Figure 2 Inter-node network

**Table 2** Inter-node affinity scheduling
Interconnection Mode	Ascend AI Processor Interconnection Mode	Scheduling Type	Method to Reduce Network Congestion	Method to Reduce Networking Costs	Method to Reduce Resource Fragments
RoCE dual-layer interconnection	Global two-layer interconnection through spine-leaf	Switch affinity scheduling 1.0	Preferentially use node resources on a leaf switch. When cross-leaf resources are used, ensure that the traffic to each spine switch is evenly distributed. Among multiple jobs on a leaf switch, only one job can use spine traffic, and other jobs are small-scaled jobs on the leaf switch.	-	If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.
RoCE dual-layer interconnection	Global two-layer interconnection through spine-leaf	Switch affinity scheduling 2.0	Preferentially use node resources on a leaf switch. When cross-leaf resources are used, ensure that the traffic to each spine switch is evenly distributed. Multiple jobs on a specific number of leaf switches are allowed to use spine traffic. Among multiple jobs on a leaf switch, only one job can use spine traffic, and other jobs are small-scaled jobs on the leaf switch.	-
Single-layer RoCE connection	Single-layer connection through leaf	Single-layer switch affinity scheduling	-	Single-layer networking can meet the requirements of parameter plane interconnection, greatly reducing networking costs.
RoCE + interconnect device	Global interconnection through spine-leaf, forming multiple SuperPoDs through the interconnect device network.	Affinity scheduling of logical SuperPoDs	A network affinity unit with a high network communication requirement can be obtained based on the job splitting policy. Ensure that each network affinity unit is distributed in the interconnect device network.	-

Parent topic: Affinity Scheduling