Solution Description

MindCluster provides two versions of switch affinity scheduling to solve downlink traffic conflicts of spine switches in the spine + leaf network architecture. To reduce networking costs, MindCluster supports switch affinity scheduling in single-layer networking. To maximize the use of the UnifiedBus network with higher bandwidth, MindCluster provides affinity scheduling for logical SuperPoDs. In the switch affinity scenario, a leaf switch has multiple nodes, and the system selects the most appropriate node and allocates it to a training job based on the configured switch affinity rules.

  • Switch affinity scheduling 1.0

    Volcano conducts affinity scheduling and ensures that the traffic during training does not cause downlink traffic conflicts on the spine switch. This feature is supported by Atlas training product and Atlas A2 training product, as well as the PyTorch and MindSpore frameworks.

  • Switch affinity scheduling 2.0

    The Volcano + iMaster NCE-Fabric solution is used. iMaster NCE-Fabric is used to dynamically configure the network connection for training job communication, without using a scheduler to resolve downlink traffic conflicts of spine switches. In addition, nodes on a switch can be used by multiple cross-switch jobs, improving cluster resource utilization. The supported products are Atlas A2 training product, and the supported framework is PyTorch.

  • Single-layer switch affinity scheduling

    Single-layer (leaf only) networking is supported by Atlas 800I A2 inference server and A200I A2 Box heterogeneous component. In single-layer switch affinity scheduling, the most appropriate node is selected for distributed inference jobs.

  • Affinity scheduling of logical SuperPoDs

    SuperPoDs are available for the Atlas 900 A3 SuperPoD. When a training job is delivered, a physical SuperPoD is divided into several logical SuperPoDs by cluster scheduling components based on the splitting policy for affinity scheduling of training products.

  • Currently, only training and inference jobs support switch affinity scheduling of the entire NPU. Static or dynamic vNPU scheduling is not supported.
  • Before using switch affinity scheduling 2.0, learn about the principles and operation guide of the parameter plane networking.

Process of Switch Affinity Scheduling 1.0

Figure 1 describes the scheduling logic of switch affinity scheduling 1.0.

Figure 1 Scheduling workflow

Process description:

  1. Volcano reads the basic-tor-node-cm file to obtain the cluster topology information and prepare for scheduling.
  2. A user delivers a training job from a deep learning platform or CLI.
  3. Volcano schedules the job pod to a proper compute node based on the information obtained from basic-tor-node-cm, and writes the switch status of the current node during pod scheduling into the annotation of the job pod.

Process of Switch Affinity Scheduling 2.0

Figure 2 describes the scheduling logic of switch affinity scheduling 2.0.

Figure 2 Scheduling workflow

Process description:

  1. Volcano reads the basic-tor-node-cm file to obtain the cluster topology information. Volcano reads the annotation of all job pods in the cluster to obtain the status of each switch in the cluster, preparing for job scheduling.
  2. A user delivers a training job from a deep learning platform or CLI.
  3. Volcano schedules the job pod to a proper compute node based on the information obtained from basic-tor-node-cm, and writes the switch status of the current node during pod scheduling into the annotation of the job pod.
  4. ClusterD uses the informer mechanism to detect that the job is scheduled to a proper compute node and collects information about all pods of the job.
  5. ClusterD writes the job information to job-summary-<JobName> ConfigMap.
  6. iMaster NCE-Fabric reads job information from job-summary-<JobName> ConfigMap and dynamically configures the network connection for training job communication.

Job Description

Switch affinity scheduling selects different scheduling policies for different job types. The job type is the value of the tor-affinity field in the YAML file of the delivered training job. Different job types have different requirements on the number of job replicas.

Table 1 Job type description

Type

Tag

Number of Job Replicas

Common job

normal-schema

No limit

Foundation model job

large-model-schema

≥ 4

Padding job

large-model-schema

< 4