(Optional) Switch Affinity Scheduling of Volcano

Volcano supports affinity scheduling of switches. To use this function, you need to upload the mapping between switches and server nodes for Volcano to use.

Currently, only training and inference jobs support switch affinity scheduling of the entire NPU. Static or dynamic vNPU scheduling is not supported.

Procedure

  1. Prepare the network design LLD document of the deployment environment and upload it to any directory (for example, /home/tor-affinity) on the Kubernetes management node.

    The LLD file name must be lld.xlsx.

  2. Obtain the LLD document parsing script.

    Go to the mindcluster-deploy repository and access the corresponding branch based on mindcluster-deploy Version Description. Download the lld_to_cm.py file in the samples/utils directory and upload the file to the directory on the management node used in Step 1.

  3. Start the lld_to_cm.py script.
    python ./lld_to_cm.py --num 32
    • Use the --num (or -n) subcommand to specify the number of nodes under a switch. If this parameter is not specified, the default value 4 is used.
    • Use the --level (or -l) subcommand to specify the switch networking type. If this parameter is not specified, the default value double_layer is used.
      • single_layer: single-layer switch networking
      • double_layer: double-layer switch networking
    • This script requires the openpyxl module. If the module is missing in the installation environment, run the pip install openpyxl command to install it.
  4. Check whether a ConfigMap is successfully created.
    kubectl get cm -n kube-system basic-tor-node-cm

    If the following information is displayed, the creation is successful:

    1
    2
    NAME                DATA   AGE
    basic-tor-node-cm   1      8s
    

Configuring Affinity Scheduling for Switches

To configure affinity scheduling for switches, you need to set the tor-affinity parameter in the job YAML file. For details about the parameter, see the following table.

Table 1 Parameter description

Parameter

Value

Description

(.kind=="AscendJob").metadata.labels.tor-affinity

  • large-model-schema: foundation model job or padding job
  • normal-schema: common job
  • null: switch affinity scheduling not used
    NOTE:

    You need to select a job type based on the number of job replicas. If the number of job replicas is less than 4, the job is a padding job. If the number of job replicas is greater than or equal to 4, the job is a foundation model job. The number of replicas of a common job is not limited.

The default value is null, indicating that switch affinity scheduling is not used. You need to set this parameter based on the job type.

NOTE:
  • Switch affinity scheduling 1.0 supports Atlas training product and Atlas A2 training product as well as PyTorch and MindSpore.
  • Switch affinity scheduling 2.0 supports Atlas A2 training product and PyTorch.
  • Switch affinity scheduling is supported only on the entire NPU. Static scheduling is not supported.