HCCL_RDMA_QP_PORT_CONFIG_PATH

Description

By default, one queue pair (QP) is created for data transfer during RDMA communication between two ranks. If you want to use multiple QPs for RDMA communication between two ranks and specify the source port numbers used for multi-QP communication, you can use this environment variable.

You can use this environment variable to specify the path for storing the configuration file that configures the mapping between <srcIP,dstIP> and ports. When multiple port numbers are configured for <srcIP,dstIP>, the system enables multi-QP communication, and the configured port numbers are the source ports used by each QP.

Environment variable configuration (example):

export HCCL_RDMA_QP_PORT_CONFIG_PATH=/home/tmp

/home/tmp indicates the path for storing the configuration file MultiQpSrcPort.cfg of the mapping between <srcIP,dstIP> and the ports. The path can be an absolute path or a relative path, with a maximum of 4096 characters.

The MultiQpSrcPort.cfg file needs to be customized by the user. Note that the file name must be MultiQpSrcPort.cfg. The configuration format is as follows:

srcIP1,dstIP1=srcPort0,srcPort1,...,srcPortN
srcIPN,dstIPN=srcPort0,srcPort1,...,srcPortN
  • The maximum number of lines that can be configured in the file is 131072 (128 × 1024).
  • Each <srcIP,dstIP> address pair supports a maximum of 32 ports. However, it is recommended that the number of ports be less than or equal to 8 for an address pair. If the number of QPs exceeds 8, the performance gain cannot be ensured and the service may fail to run due to excessive memory usage.
  • Each <srcIP, dstIP> address pair can appear only once in the file.
  • srcIP and dstIP must be in IPv4 format rather than IPv6 format.
  • srcIP and dstIP can be set to 0.0.0.0, indicating all IP addresses.

The following is a configuration example of the MultiQpSrcPort.cfg file:

192.168.100.2,192.168.100.3=61100,61101,61102
192.168.100.4,192.168.100.5=61100,61101,61102,61104
0.0.0.0,192.168.100.122=65515,65516,65513

Example

export HCCL_RDMA_QP_PORT_CONFIG_PATH=/home/tmp

Restrictions

  • This environment variable supports only the single-operator calling mode and does not support the static graph mode.
  • The priority of this environment variable is higher than that of the environment variable HCCL_RDMA_QPS_PER_CONNECTION. After this environment variable is set, the number of QPs used for communication between two ranks is subject to the number of source port numbers configured in the MultiQpSrcPort.cfg file.
  • The QP configuration priority is as follows:

    Multi-QP configuration on the management plane (configured using the -s multi_qp parameter of hccn_tool) > QP configuration of the NSLB (configured using the -t nslb-dp parameter of hccn_tool) > Environment variable HCCL_RDMA_QP_PORT_CONFIG_PATH > Environment variable HCCL_RDMA_QPS_PER_CONNECTION

Applicability

Atlas A2 training products/Atlas A2 inference products (For Atlas A2 training products/Atlas A2 inference products, only the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack are supported.)

Atlas A3 training products/Atlas A3 inference products