Node Scoring

Description

Score all pre-selected nodes based on the affinity policy for the scheduler to select the most suitable node.

For example, a pod requires one Ascend AI processor, and there are two nodes A and B that meet the job requirements. One HCCL ring of node A has one Ascend AI processor, and the two rings of node B have two and three Ascend AI processors, respectively. According to the affinity policy, node A obtains a higher score.

Implementation Details

For details about the code implementation, see ScoreBestNPUNodes in the open-source code. In the code, getNodeBestScore determines the node priority based on affinity principles. When a node is selected, whether switch affinity scheduling and logical SuperPoD affinity scheduling are configured is preferentially checked. If neither switch affinity scheduling nor logical SuperPoD affinity scheduling is configured, the common node selection principle is used.

Common Node Selection Principle

A two-dimensional array, for example affScoreList[i][j], is used to indicate node misalignment of a job. In affScoreList[i][j], i indicates the number of processors required by one job pod minus 1, and j indicates the number of available processors of the node minus 1.

For example, if one pod requires six processors, the nodes with 1 to 5 available processors do not meet the scheduling requirements. Therefore, affScoreList[i][j] is set to 8. The node with six available processors just meets the scheduling requirements and does not generate resource fragments. In this case, affScoreList[i][j] is set to 0. For nodes with 7 or 8 available processors, in consideration of minimizing resource fragments, affScoreList[i][j] is set to 1 or 2 respectively. Therefore, the following can be deduced:

affScoreList[5] = []int{8,8,8,8,8,0,1,2}

Similarly,

affScoreList[3] = []int{8,8,8,0,1,2,3,4}

The overall logic remains consistent in some cases where the total number of processors used by some products is different, the HCCS ring exists, or the two-dimensional array is fine-tuned.

Optimization Principles for Switch Affinity Scheduling

The cluster scheduling components obtain the mapping between nodes and switches in the entire cluster based on the basic-tor configuration file, and obtain the node resources of all idle switches on the spine network based on the processor usage information reported by Ascend Device Plugin. The idle switch on the spine network refers to a switch that has no job or has only a padding job that does not use the spine network.

Idle switch resources can be divided into two two-dimensional arrays based on nodes connected to the leaf switch and relative locations of nodes on the leaf switch. (Nodes at different locations on different leaf switches may form a logical switch with network affinity). The two two-dimensional arrays are sorted in descending order of the remaining nodes. The methods for dividing two-dimensional arrays are described as follows:

  • Method 1: division based on nodes under the leaf switch, for example, [node1,node2,node3,node4].
  • Method 2: division based on relative positions of nodes under the leaf switch, for example, [node1,node5,node9,node13,node17,node21].
Figure 1 Two-dimensional array division
Table 1 Node selection principles

Job Type

Description

Node Selection Principle

Padding job

This job can be delivered only to one switch.

The switch that meets the job deployment requirements is selected from the end of a two-dimensional array. If no switch meets the requirement after the two-dimensional array are traversed, the job waits.

Foundation model job

This job can be performed cross switches and must meet the switch affinity requirements.

The complete switch resources are selected from the beginning of the two-dimensional array. If the resources are sufficient, scheduling is successful. If the resources are insufficient, pay attention to the following cases:

  • For switch affinity 1.0, the remaining switch resources are divided into two-dimensional arrays based on the relative positions nodes under the leaf switch. Elements in arrays are selected one by one until the resources meet the job requirements or all elements in an array are selected.
  • For switch affinity 2.0, elements in arrays are selected one by one until the resources meet the job requirements or all elements in an array are selected. If resources are still insufficient, select nodes under non-idle switches on the spine network. One job can contain a maximum of two non-idle switches on the spine network.

Common job

This job meets the switch affinity requirements as much as possible. When resources are insufficient, this job allows for random scheduling.

The scheduling logic of the first part of a common job is the same as that of a foundation model job. The only difference is that the remaining nodes can be randomly used when the resources of the logical switch are insufficient.

Affinity Scheduling of Logical SuperPoDs

  1. The remaining SuperPoDs are distributed into three queues based on the size of logical SuperPoDs. The size of queue 1 is greater than or equal to the sum number of logical SuperPoDs and reserved nodes. The size of queue 2 is greater than or equal to the number of logical SuperPoDs and less than the sum number of logical SuperPoDs and reserved nodes. The size of queue 3 is less than the number of logical SuperPoDs.
  2. The data of queue 1 is preferentially used, and queue 1 is split into a three-dimensional array. Assume that the total number of logical SuperPoDs is 16, the number of required logical SuperPoDs is 2, and the number of reserved nodes is 2. First, SuperPoDs are arranged into two-dimensional arrays based on available nodes. Within each two-dimensional array, multiple SuperPoDs with the same number of nodes are placed. Consequently, the entire structure forms a three-dimensional array. In this case, a SuperPoD selection sequence is shown in Figure 2. Specifically, a SuperPoD with 18 available nodes is preferred, followed by those with 26, 19, 27, and 33 available nodes. If a SuperPoD meeting the requirement is still not found, select the SuperPoD with 34, 35, 36, 37, ... 46, 47, or 48 available nodes.
    Figure 2 SuperPoD selection sequence
  3. If the resources are still insufficient, the resources of queue 2 are used. Queue 2 is sorted in descending order of the number of remaining nodes, and SuperPoDs are selected from the first data record.
  4. If the resources are still insufficient and the SuperPoD affinity scheduling policy is Soft (non-forced affinity), the resources of queue 3 are used. Queue 3 is sorted in descending order of the number of remaining nodes, and SuperPoDs are selected from the first data record.