Affinity Scheduling Policies

Table 1 describes the features and resource utilization rules of the Atlas 200T A2 Box16 heterogeneous subrack and Atlas 200I A2 Box16 heterogeneous subrack.

Table 1 Affinity policies

Priority

Policy Name

Policy Description

1

HCCS interconnection-based allocation

If the number of Ascend AI Processors to be allocated ranges from 1 to 8, the processors need to be scheduled to the same HCCS for interconnection. If the number of Ascend AI Processors to be allocated is 10, 12, or 14, the required Ascend AI Processors must be evenly allocated to two rings, and their physical addresses must be the same.

2

Full priority scheduling

Nodes that have been allocated with Ascend AI Processors are preferentially scheduled to reduce fragments.

Assume that the number of processors to be allocated is 1, 2, 4, and 8.

  • If one Ascend AI Processor needs to be allocated, choose a node with one available Ascend AI Processor for HCCS interconnection first, and then select a node with two, three,..., up to eight available processors. If the number of available Ascend AI Processors on nodes is consistent, preferentially select the node with a smaller number of Ascend AI Processors.
  • If two Ascend AI Processors need to be allocated, choose a node with two available Ascend AI Processors for HCCS interconnection first, and then select a node with three, four,..., up to eight available processors. If the number of available Ascend AI Processors on nodes is consistent, preferentially select the node with a smaller number of Ascend AI Processors.
  • If four Ascend AI Processors need to be allocated, choose a node with four available Ascend AI Processors for HCCS interconnection first, and then select a node with five, six,..., up to eight available processors. If the number of available Ascend AI Processors on nodes is consistent, preferentially select the node with a smaller number of Ascend AI Processors.
  • If eight Ascend AI Processors need to be allocated, allocate only the node with eight available Ascend AI Processors for HCCS interconnection. If the number of available Ascend AI Processors on nodes is consistent, preferentially select the node with a smaller number of Ascend AI Processors.
NOTE:

When a distributed job is delivered, the job does not fully occupy a node as the full priority scheduling principle required. Description:

  • Symptom: For example, in a cluster with two Atlas 200T A2 Box16 heterogeneous subracks or Atlas 200I A2 Box16 heterogeneous subracks, if five-processor, four-processor, and three-processor jobs are delivered at the same time, the four-processor and three-processor jobs are scheduled to the same node, and the five-processor job is scheduled to another node.
  • Cause analysis: After Volcano schedules a job, there is a delay for Ascend Device Plugin to report the scheduled Ascend AI Processor topology to mindx-dl-deviceinfo-${node_name}. As a result, Volcano fails to verify the number of Ascend AI Processors on a node and the job is scheduled to another node.