Execution

Prerequisites

  • The firewall has been disabled in the operating environment.
  • The number of concurrent connections that can be processed by the master node is restricted by the Linux kernel parameters somaxconn and tcp_max_syn_backlog. Therefore, for large-scale cluster networking, setting small values for somaxconn and tcp_max_syn_backlog may cause some clients to exit abnormally, thus leading to cluster initialization failures.

    In the scenario of large-scale cluster networking, it is recommended to adjust the values of somaxconn and tcp_max_syn_backlog on the master node based on the number of clusters, for example:

    sysctl -w net.core.somaxconn=65535 
    sysctl -w net.ipv4.tcp_max_syn_backlog=65535

Procedure

  1. Configure the environment variables on which the HCCL Performance Tester depends.
    • In the MPICH installation scenario:
      export INSTALL_DIR=/usr/local/Ascend/ascend-toolkit/latest
      export PATH=/usr/local/mpich/bin:$PATH
      export LD_LIBRARY_PATH=/usr/local/mpich/lib:${INSTALL_DIR}/lib64:$LD_LIBRARY_PATH
    • In the Open MPI installation scenario:
      export INSTALL_DIR=/usr/local/Ascend/ascend-toolkit/latest
      export PATH=/usr/local/openmpi/bin:$PATH
      export LD_LIBRARY_PATH=/usr/local/openmpi/lib:${INSTALL_DIR}/lib64:$LD_LIBRARY_PATH

    INSTALL_DIR is the CANN directory, and /usr/local/Ascend is the default installation path of the root user. If a common user is used or the installation path is specified, replace it with an actual path.

    /usr/local/mpich and /usr/local/openmpi are the MPI installation paths. Replace them with actual paths.

    There is no need to configure the preceding environment variables if they already exist.

  2. Configure HCCL environment variables.
    1. Configure environment variables for initializing the root communication NIC on the node that launches the training process.
      Configure the IP version and NIC name used by the initial root communication NIC of the HCCL. The HCCL can obtain the host IP based on the configured NIC name, to create a communicator.
      # IP version used by the initial root communication NIC of the HCCL. AF_INET indicates that IPv4 is used. AF_INET6 indicates that IPv6 is used.
      export HCCL_SOCKET_FAMILY=AF_INET
      
      # The following formats of NIC names are supported. (Select one from the four formats. If multiple NICs are configured in the environment variables, separate them with commas (,). The NIC that is first matched is used as the root NIC.)
      
      # Exact match of the NIC
      export HCCL_SOCKET_IFNAME==eth0,enp0   # Use the specified eth0 or enp0 NIC.
      export HCCL_SOCKET_IFNAME=^=eth0,enp0     # Do not use the eth0 or enp0 NIC.
      
      # Fuzzy match of the NIC
      export HCCL_SOCKET_IFNAME=eth,enp       # Use all NICs prefixed with eth or enp.
      export HCCL_SOCKET_IFNAME=^eth,enp      # Do not use any NIC prefixed with eth or enp.

      Notes:

      When the MPI tool is executed, environment variables are synchronized to all nodes. If the NIC names of different nodes involved in collective communication are different, for example, the NIC name of node 1 is eth1 and that of node 2 is eth2, you need to configure environment variables using fuzzy match.

    2. Adjust the timeout for socket connection establishment.

      In the collective communication scenario, the default timeout for socket connection establishment between devices is 120s. However, when the master node needs to establish many connections and process a large amount of data, the default value cannot meet the requirements and needs to be adjusted.

      For example, if the number of NICs in a cluster is 3,000, change the timeout to 240s. If the number of NICs in a cluster is 5,000, change the timeout to 600s.

      export HCCL_CONNECT_TIMEOUT=600
    3. Adjust the size of the shared buffer between NPUs.

      The default size of the shared buffer between two NPUs is 200 MB, which can be adjusted by using the environment variable HCCL_BUFFSIZE. The unit is MB, and the value must be greater than or equal to 1 MB.

      In the collective communication network, each HCCL communicator occupies a buffer of the size of HCCL_BUFFSIZE. If there are many HCCL communicators in a cluster network, the buffer usage increases, which may affect the normal storage of model data. In this case, you can use HCCL_BUFFSIZE to reduce the buffer size occupied by the communicators. If the model data size of the service is small but the communication data size is large, you can use HCCL_BUFFSIZE to increase the size of the buffer occupied by the HCCL communicators, to improve data communication efficiency.

      When the hccl_test tool is used to perform a performance test, the communication data size is large. In this case, you can increase the value of HCCL_BUFFSIZE to improve data communication efficiency. For the collective communication operator, when the test data size exceeds the value of HCCL_BUFFSIZE, the performance may deteriorate. It is recommended that the value of HCCL_BUFFSIZE be greater than the test data size.

      Example:

      export HCCL_BUFFSIZE=2048

      For more environment variables, see "Collective Communication" in Environment Variables.

  3. Configure the hostfile file.

    The hostfile file is used to specify the nodes on which the communication process needs to be started. It is a text file that needs to be customized.

    • When MPICH is installed, only the IPv4 communication is supported. The content format is as follows:
      {Node IP address}:{Number of processes on each node}

      For example, define the hostfile file name as hostfile. The content is as follows:

      10.10.130.22:8
      10.10.130.21:8
    • When Open MPI is installed, both IPv4 and IPv6 communications are supported. The content format is as follows:
      {Node name slots}={Number of processes on each node}

      For example, define the hostfile file name as hostfile. The content is as follows:

      node3 slots=8
      node4 slots=8
    • There is no need to configure the hostfile file for single-server scenarios.
  4. Execute the HCCL Performance Tester.

    Execute the HCCL Performance Tester in the hccl_test directory.

    • In the MPICH installation scenario, the command format is as follows:
      mpirun [-f hostfile] [-n number] ./bin/<executable_file> [-p npus] [-b minbytes] [-e maxbytes] [-f stepfactor] [-o operator] [-r root] [-d datatype] [-n iters] [-w warmup_iters] [-c <0/1>]

      Command example:

      mpirun -f hostfile -n 16 ./bin/all_reduce_test -p 8 -b 8K -e 64M -f 2 -d fp32 -o sum
      • mpirun is followed by MPI options.
      • ./bin/<executable_file> is followed by the options of the HCCL Performance Tester.

      For details about the options of the MPICH and collective communication test commands, see Description of Command-Line Options.

    • In the Open MPI installation scenario, the command format is as follows:
      mpirun [-hostfile hostfile] [-n number] [-x environment_variable_name] [--allow-run-as-root] [--mca key value] ./bin/<executable_file> [-p npus] [-b minbytes] [-e maxbytes] [-f stepfactor] [-o operator] [-r root] [-d datatype] [-n iters] [-w warmup_iters] [-c <0/1>]

      Command example:

      mpirun -hostfile hostfile -x LD_LIBRARY_PATH -x HCCL_SOCKET_FAMILY -x HCCL_SOCKET_IFNAME -x HCCL_CONNECT_TIMEOUT -x HCCL_BUFFSIZE --allow-run-as-root --mca btl_tcp_if_include eth0 --mca opal_set_max_sys_limits 1 -n 16 ./bin/all_reduce_test -p 16 -b 8K -e 64M -i 0 -o sum -d fp32 -w 3 -n 3
      • mpirun is followed by MPI options.
      • ./bin/<executable_file> is followed by the options of the HCCL Performance Tester.

      For details about the options of the Open MPI and collective communication test commands, see Description of Command-Line Options.