Affinity Scheduling Policies

Table 1 describes the characteristics and resource utilization rules of Ascend AI Processors of Atlas training product.

Table 1 Affinity policies of Atlas training products

Priority

Policy Name

Details

1

HCCS affinity scheduling

Select Ascend AI Processors in one HCCS to improve communication performance.

  • If one Ascend AI Processor needs to be allocated, ensure that it is selected from a single HCCS. The node with one available Ascend AI Processor is the best, with three being the next best option, followed by two, and lastly four.
  • If two Ascend AI Processors need to be allocated, ensure that they are selected from a single HCCS. The node with two available Ascend AI Processors is the best, with four being the next best option, and lastly three.
  • If four Ascend AI Processors need to be allocated, ensure that they are selected from a single HCCS. The node with four available Ascend AI Processors is the best.
  • If eight Ascend AI Processors need to be allocated, the eight Ascend AI Processors of the node will be selected.

2

Full priority scheduling

Nodes that have been allocated with Ascend AI Processors are preferentially scheduled to reduce fragments.
  • If one Ascend AI Processor needs to be allocated, choose a node whose resource capacity is eight and number of available Ascend AI Processors in the HCCS is one (ideally), three, two, or four.
  • If two Ascend AI Processors need to be allocated, choose a node whose resource capacity is eight and number of available Ascend AI Processors in the HCCS is two (ideally), four, or three.
  • If four Ascend AI Processors need to be allocated, choose a node whose resource capacity is eight and number of available Ascend AI Processors is four.
  • If the number of Ascend AI Processors to be allocated is a multiple of eight, select a node whose capacity size is eight and that does not use any Ascend AI Processor.
NOTE:
When a distributed job is delivered, the job does not fully occupy a node as the full priority scheduling principle required. Description:
  • Symptom: For example, in a cluster with two Atlas 800 training servers (model 9000), if three-processor, four-processor, and one-processor jobs are delivered at the same time, the three-processor and four-processor jobs are scheduled to the same node, and the one-processor job is scheduled to another node.
  • Cause analysis: After Volcano schedules a job, there is a delay for Ascend Device Plugin to report the scheduled Ascend AI Processor topology to mindx-dl-deviceinfo-${node_name}. As a result, Volcano fails to verify the number of Ascend AI Processors on a node and the job is scheduled to another node.

3

Even number priority scheduling

The HCCS that meets policies 1 to 2 is preferentially selected, and then the HCCS whose number of remaining Ascend AI Processors is an even number is selected.