Affinity Scheduling Description

Affinity scheduling maximizes the computing power of Ascend AI processors by reducing resource fragments and network congestion.

  • Resource fragments

    After a job is deployed, deploy the remaining AscendAI processors in units with smoother network connections (such as nodes, SuperPoDs, or nodes under a single switch). This prevents job scheduling failures due to scattered resources even when the total number of AscendAI processors is sufficient.

  • Network congestion

    AscendAI processors can be connected in multiple modes. The interconnection mode varies depending on networking modes of different products, resulting in different network bandwidth. You can select a proper scheduling policy based on the interconnection mode of AscendAI processors to reduce network congestion.

Ascend AI Processor-based Affinity Scheduling

There are three processor connection modes in hardware products. Regarding scheduling priority, a job is preferentially scheduled to the Ascend AI processor within the same inference card or training card, then to the Ascend AI processor interconnected through HCCS, and finally to the Ascend AI processor interconnected through PCIe.

Huawei Cache Coherence System (HCCS) is the hardware form of Huawei Collective Communication Library (HCCL) that facilitates high-performance communication between servers in deep learning training scenarios.

Figure 1 Ascend AI processor interconnection modes

Different hardware products may use one or more of the three interconnection modes. The following table describes scheduling policies in detail.

Table 1 Ascend AI processor-based affinity scheduling

Product

Ascend AI Processor Interconnection Mode

Method to Reduce Network Congestion

Method to Reduce Resource Fragments

Atlas training product

Four Ascend AI processors are interconnected through HCCS, and Ascend AI processors in HCCS rings are interconnected through PCIe.

Allocate the job with four or fewer Ascend AI processors to one HCCS ring.

If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.

Atlas 200T A2 Box16 heterogeneous subrack

Atlas 200I A2 Box16 heterogeneous subrack

Eight Ascend AI processors are interconnected through HCCS, and Ascend AI processors in HCCS rings are interconnected through PCIe.

  • Allocate the job with eight or fewer Ascend AI processors to one HCCS ring.
  • Evenly allocate the job with more than eight Ascend AI processors to two rings.

If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.

Atlas 900 A3 SuperPoD

A200T A3 Box8 SuperPoD Server

Atlas 800I A3 SuperPoD Server

Atlas 800T A3 SuperPoD Server

Two Ascend AI processors form eight HiAM modules through SIO, and each HiAM module is interconnected through HCCS.

If the number of Ascend AI processors is an even number, they must be scheduled to one HiAM module.

-

Atlas 800 inference server (model 3000) (with Atlas 300I inference cards)

Each inference card has four interconnected Ascend AI processors, but inference cards are not interconnected.

If the number of allocated Ascend AI processors is less than 4 and scheduling is performed by inference card, the job must be scheduled to one inference card.

If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.

Atlas 800 inference server (model 3000) (with Atlas 300I Duo inference cards)

Two Ascend AI processors in each inference card are interconnected through HCCS, and inference cards are interconnected through PCIe.

In distributed inference scheduling, a job must be scheduled to the entire Atlas 300I Duo inference card.

If the number of Ascend AI processors required by a job is an odd number, the job is preferentially scheduled to the Atlas 300I Duo inference card whose number of remaining Ascend AI processors is 1.

If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.

Node-based Affinity Scheduling

Nodes are connected through the RoCE network or interconnect device + RoCE network. The interconnect device network is preferentially used during job scheduling. The RoCE network uses the spine-leaf architecture. The network traffic is preferentially controlled at the leaf layer. If spine must be used, ensure that traffic is evenly distributed to each spine layer.

  • Products using RoCE connections: Atlas 800T A2 training server, Atlas 800I A2 inference server, A200I A2 Box heterogeneous component, Atlas 200T A2 Box16 heterogeneous subrack, Atlas 200I A2 Box16 heterogeneous subrack, Atlas 800 training server (model 9000), and Atlas 800 training server (model 9010)
  • Products that use single-layer RoCE connections: Atlas 800I A2 inference server and A200I A2 Box heterogeneous component
  • Products using UnifiedBus + RoCE connections: Atlas 900 A3 SuperPoD
Figure 2 Inter-node network
Table 2 Inter-node affinity scheduling

Interconnection Mode

Ascend AI Processor Interconnection Mode

Scheduling Type

Method to Reduce Network Congestion

Method to Reduce Networking Costs

Method to Reduce Resource Fragments

RoCE dual-layer interconnection

Global two-layer interconnection through spine-leaf

Switch affinity scheduling 1.0

  • Preferentially use node resources on a leaf switch.
  • When cross-leaf resources are used, ensure that the traffic to each spine switch is evenly distributed.
  • Among multiple jobs on a leaf switch, only one job can use spine traffic, and other jobs are small-scaled jobs on the leaf switch.

-

If the network statuses of two resources are the same, use the one with fewer resource fragments generated after scheduling.

Global two-layer interconnection through spine-leaf

Switch affinity scheduling 2.0

  • Preferentially use node resources on a leaf switch.
  • When cross-leaf resources are used, ensure that the traffic to each spine switch is evenly distributed.
  • Multiple jobs on a specific number of leaf switches are allowed to use spine traffic.
  • Among multiple jobs on a leaf switch, only one job can use spine traffic, and other jobs are small-scaled jobs on the leaf switch.

-

Single-layer RoCE connection

Single-layer connection through leaf

Single-layer switch affinity scheduling

-

Single-layer networking can meet the requirements of parameter plane interconnection, greatly reducing networking costs.

RoCE + interconnect device

Global interconnection through spine-leaf, forming multiple SuperPoDs through the interconnect device network.

Affinity scheduling of logical SuperPoDs

A network affinity unit with a high network communication requirement can be obtained based on the job splitting policy. Ensure that each network affinity unit is distributed in the interconnect device network.

-