Affinity Policies in Distributed Scenarios

Distributed Affinity Policies of Atlas Training Products

For distributed training jobs, each node can allocate 1, 2, 4, or 8 Ascend AI Processors, and each job needs to be scheduled to different nodes.

  • In versions earlier than MindCluster 5.0.RC1, each node in a distributed training job can allocate only eight Ascend AI Processors due to bottom-layer restrictions.

  • In MindCluster 5.0.RC1 and later versions, each node in a distributed training job can allocate 1, 2, 4, or 8 Ascend AI Processors. For details about the affinity policy for a single node, see Affinity Policies in Single-Device Scenarios.

Distributed Affinity Policies of Atlas 200I/200T A2 Box16 Heterogeneous Subrack

  • For distributed jobs running on the Atlas 200T A2 Box16 heterogeneous subrack and Atlas 200I A2 Box16 heterogeneous subrack, each node can allocate 1 to 8, 10, 12, 14, or 16 Ascend AI Processors.
  • If the number of Ascend AI Processors allocated to a training job is less than or equal to 8, select Ascend AI Processors in the same HCCS ring.
  • If the number of Ascend AI Processors allocated to a training job is 10, 12, or 14, allocate the required Ascend AI Processors evenly to two rings. The relative physical addresses can be different.

Distributed Affinity Policies of Atlas 900 A3 SuperPoD

  • For a logical SuperPoD affinity task (the sp-block field in the YAML file is set to the logical SuperPoD size), the number of Ascend AI Processors that can be allocated is 16.
  • If distributed scheduling of more than 16 processors is used, set the huawei.com/schedule_policy field in the YAML file to chip2-node16. The affinity policy is the same as that of Atlas 800T A3 SuperPoD Server. When multiple job pods are scheduled to a single node, collective communication between pods is not supported.

Distributed Affinity Policies of A200T A3 Box8 SuperPoD Server, Atlas 800I A3 SuperPoD Server, and Atlas 800T A3 SuperPoD Server

The number of Ascend AI Processors allocated to a job can be 2, 4, 6, 8, 10, 12, 14, or 16. When multiple job pods are scheduled to a single node, collective communication between pods is not supported.

Distributed Affinity Policy for an Inference Server (with Atlas 300I Inference Cards)

  • The number of Ascend AI Processors allocated to an inference job cannot be greater than the total number of Ascend AI Processors on a node.
  • If the number of Ascend AI Processors allocated to an inference job is less than or equal to 4, the inference job needs to be scheduled to one Atlas 300I inference card.

Distributed Affinity Policy for an Inference Server (with Atlas 300I Duo Inference Cards)

  • The number of Ascend AI processors allocated to an inference job cannot be greater than the total number of Ascend AI processors on a node.
  • If the number of Ascend AI processors allocated to an inference job is less than or equal to 2, the inference job needs to be scheduled to one Atlas 300I Duo inference card.