Multi-Server Multi-Device Training

Using a dual-server setup (8 devices per node) as an example, this section describes how to configure resource information through environment variables to start training tasks in multi-server multi-device scenarios.

Prerequisites

  • Set the following environment variable. For details, refer to the little-demo configuration file or Environment Variable Configuration.
    CM_CHIEF_IP={host_ip}
    CM_CHIEF_PORT=60000
    CM_CHIEF_DEVICE=0
    CM_WORKER_IP={host_ip}
    CM_WORKER_SIZE=8
  • The little_demo code paths and the names of the network interface cards (NICs) used for IP configuration must be identical across both servers.
  • Cluster networking must be configured. For details, see the "Networking" in the Ascend Training Solution Networking Guide.
  • The Rec SDK TensorFlow training image must be created as per Rec SDK TensorFlow Training Image Building.
  • Basic cluster network settings must be verified, such as ensuring NPU device IP addresses across servers can ping each other and that TLS configurations for NPU devices are consistent. If cluster communication fails, refer to HCCL Cluster Communication Failure in Multi-Server Training for troubleshooting.

Procedure

The primary steps for multi-server training include: downloading the Rec SDK TensorFlow image, creating containers, modifying SSH configurations and starting services, configuring password-free login between physical nodes, and finally configuring the little_demo model to launch the training process. In a dual-server setup, either server can be designated as the primary node.

  1. Create and configure containers on both servers.
    1. Determine the port number for the nodes. All nodes must use the same unoccupied port. Run the following command on the physical machine to check port availability (for example, port 12345):
      ss -tuln | grep 12345

      If the ss command is unavailable, install the iproute package.

      • Empty result: Port is available => Proceed to step 1.b.
      • Non-empty: Port in use => Try another port.
    2. Modify the sshd_config file in the container.
      1
      vi /etc/ssh/sshd_config
      
      1. Uncomment #Port 22 and change it to the available port (for example, 12345). This allows MPI to access other containers through this port. Ensure the host-side firewall allows traffic on this port.
      2. Optional: Uncomment #ListenAddress 0.0.0.0 and change 0.0.0.0 to the current node's IP address. If deploying through image export/cloning, manually update the SSH listening IP address to the new node's IP after starting the container.

        If this step is not performed, cluster training is not affected, but the port corresponding to the host machines with all-zero IP addresses will be listened. Modification is recommended for improved security.

    3. Restart the SSH service within the container.
      1
      systemctl restart sshd
      

      After restarting, run systemctl status sshd to verify the service status.

      You can run the ss -tuln | grep 12345 command to check whether the listening port is the configured port.

      To stop the SSH service in the container, run:

      1
      kill -9 `ps -ef | grep sshd | grep -v grep | awk '{print $2}'` > /dev/null 2>&1
      
    4. Configure environment variables within the container.

      Add the required variables to ~/.bashrc to ensure they are automatically loaded during password-free SSH sessions initiated by the primary node.

      The following is an example. Configure the variables based on site requirements.

      1. Open the ~/.bashrc file.
        1
        vi ~/.bashrc
        
      2. Append the following environment paths to the end of the file:
        1
        2
        3
        4
        5
        6
        7
        8
        9
        source /etc/profile
        source /usr/local/Ascend/cann/set_env.sh
        source /usr/local/Ascend/driver/bin/setenv.bash
        export PATH=/usr/local/openmpi/bin:$PATH
        export PATH=/usr/local/python3.7.5/bin:$PATH
        export PATH=/usr/local/gcc7.3.0/bin:$PATH
        export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:$LD_LIBRARY_PATH
        export LD_LIBRARY_PATH=/usr/local/gcc7.3.0/lib64:$LD_LIBRARY_PATH
        export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH
        
      3. Press Esc, type :wq!, and press Enter to save the changes and exit.
      4. Run source ~/.bashrc to apply the changes.
  2. Configure the little_demo model.
    1. Run the following to check the device IP addresses for the 8-device setup:
      1
      for i in {0..7}; do hccn_tool -i $i -ip -g ; done
      
    2. Configure resource information.
      • Rank table mode: All nodes require an identical hccl_json_16p_2_host.json file. The following is an example of the hccl_json_16p_2_host.json file configuration.

        In this dual-server example, the primary node's device information must be listed first. Replace {device_ip} and {host_ip} with actual values. Ensure rank IDs are in ascending order.

        {
            "server_count":"2",
            "server_list":[
                {
                    "device":[
                        { "device_id":"0", "device_ip":"{device_0_ip}", "rank_id":"0" },
                        { "device_id":"1", "device_ip":"{device_1_ip}", "rank_id":"1" },
                        { "device_id":"2", "device_ip":"{device_2_ip}", "rank_id":"2" },
                        { "device_id":"3", "device_ip":"{device_3_ip}", "rank_id":"3" },
                        { "device_id":"4", "device_ip":"{device_4_ip}", "rank_id":"4" },
                        { "device_id":"5", "device_ip":"{device_5_ip}", "rank_id":"5" },
                        { "device_id":"6", "device_ip":"{device_6_ip}", "rank_id":"6" },
                        { "device_id":"7", "device_ip":"{device_7_ip}", "rank_id":"7" }
                    ],
                    "server_id":"{host_1_ip}"
                },
                {
                    "device":[
                        { "device_id":"0", "device_ip":"{device_8_ip}", "rank_id":"8" },
                        { "device_id":"1", "device_ip":"{device_9_ip}", "rank_id":"9" },
                        { "device_id":"2", "device_ip":"{device_10_ip}", "rank_id":"10" },
                        { "device_id":"3", "device_ip":"{device_11_ip}", "rank_id":"11" },
                        { "device_id":"4", "device_ip":"{device_12_ip}", "rank_id":"12" },
                        { "device_id":"5", "device_ip":"{device_13_ip}", "rank_id":"13" },
                        { "device_id":"6", "device_ip":"{device_14_ip}", "rank_id":"14" },
                        { "device_id":"7", "device_ip":"{device_15_ip}", "rank_id":"15" }
                    ],
                    "server_id":"{host_2_ip}"
                }
            ],
            "status":"completed",
            "version":"1.0"
        }
      • No-rank-table mode: No JSON file is needed, but environment variables must be set manually in main.py on all worker nodes.
        Add the following code to the next line of import os at the top of main.py:
        # If no ranktable is used, set the CM_WORKER_IP environment variable to the IP address of the current node. {host_ip} indicates the IP address of the current node.
        if not os.getenv("RANK_TABLE_FILE", ""):
            os.environ['CM_WORKER_IP'] = "{host_ip}"
    3. Modify the run.sh script (only on the primary node).
      1. Change the value of num_server to the actual number of nodes, for example, 2.
      2. Delete the -mca btl_tcp_if_exclude docker0 string from the value of mpi_args.
      3. Change the value of interface to the name of the NIC configured with the current host IP address. You can run the ip addr command to query the value.
      4. When the rank table startup solution is used, you also need to change the JSON file name in the value of the export RANK_TABLE_FILE variable to be the same as the JSON file name configured in the preceding step. For example:
        Change
        1
        export RANK_TABLE_FILE="${cur_path}/hccl_json_${local_rank_size}p.json"
        

        to:

        1
        export RANK_TABLE_FILE="${cur_path}/hccl_json_16p_2_host.json"  
        
      5. Specify the port number and modify the host parameter in the horovodrun startup command as follows:

        At the end of the run.sh script, change

        1
        xxx --mpi-args "${mpi_args}" --mpi -H localhost:${local_rank_size} 
        

        to:

        xxx --mpi-args "${mpi_args}" -p 12345 --mpi -H {host_1_ip}:8,{host_2_ip}:8
        • -p 12345: listening port number of the SSH server in the container.
        • 8: number of devices involved in training on a single node.
  3. Set password-free login on each node.
    1. Run the following command in the container on each node to set password-free login: {target_host_user} indicates the user name of the peer node, and {target_host_ip} indicates the IP address of the peer node.
      ssh-copy-id -i ~/.ssh/id_rsa.pub {target_host_user}@{target_host_ip}

      By default, this command appends the public key to the ~/.ssh/authorized_keys file on the host machine of the peer node. You need to copy the authorized_keys file on the host machine to the ~/.ssh directory in the container.

      If the system displays a message indicating that id_rsa.pub does not exist, run the following command to generate id_rsa.pub in the container: For security purposes, you are advised to change the current umask value to 0077 before running the command and restore the value to the original one after the command is executed. In addition, you are advised to enter a key password that meets the complexity requirements when the message "Enter passphrase" is displayed.

      1
      ssh-keygen -t rsa -b 3072 -f ~/.ssh/id_rsa
      

      The preceding is an example. Pay attention to the risks of using and keeping the SSH key and key password, especially the risks when the key is not encrypted. You need to perform related configurations according to the security policies of your organization, such as password complexity requirements and security configurations (protocols, cipher suites, key lengths, and whether ssh-keygen can be used).

    2. Set the SSH agent to manage SSH keys.

      Run the following commands to set the SSH agent:

      a. Start the bash process of the SSH agent.

      ssh-agent bash

      When this command is executed, the environment variables in the container will be reset. You are advised to save necessary environment variables to the ~/.bashrc file in the container and run the source ~/.bashrc command to reconfigure the environment variables after the command is executed.

      b. Add a private key to the SSH agent.

      ssh-add ~/.ssh/id_rsa

      When the message "Enter passphrase for /root/.ssh/id_rsa:" is displayed after the preceding command are executed, enter the password set when the id_rsa private key is generated.

      c. Check whether the private key is successfully added.

      ssh-add -l 
    3. Check whether the password-free login is successfully configured.
      ssh-keygen -R {target_host_ip} # Delete the host key cache of the target IP address from the current node.
      ssh-keygen -R "[{target_host_ip}]:12345" # Delete the host key cache of the port corresponding to the target IP address from the current host.
      ssh {target_host_user}@{target_host_ip} # Verify the password-free login to the target IP address.
      ssh {target_host_user}@{target_host_ip} -p 12345 # Verify the password-free login to the specified port of the target IP address.
  4. Run the following command on the primary node to start the model: You only need to start the training job on the primary node. The worker node does not need to manually start the training job. MPI automatically switches to the standby node to start the training job.
    • Rank table mode:
      1
      bash run.sh main.py
      
    • No-rank-table mode: {host_1_ip} indicates the IP address of the primary node.
      bash run.sh main.py {host_1_ip}

      After the training job is started, run the exit command to exit the bash process of ssh-agent to prevent security risks.