HCCL_HOST_SOCKET_PORT_RANGE

Description

Configures the communication port used by HCCL on the host when the communicator is created based on root node information.

This environment variable can be set to a specific port, port range, or the string auto.
  • If a specific port number or port range is specified, it is recommended that the number of planned ports be greater than or equal to the number of HCCL processes on a single NPU. The port number ranges from [1, 65535]. Ensure that the specified port is not occupied by other processes. Note that ports [1, 1023] are reserved for the system. Avoid using these ports.

    The port number and port range can be used together. Use commas (,) to separate them. However, the port numbers and port ranges separated by commas (,) cannot overlap. For details about configuration, see Example.

  • If this environment variable is set to auto, the host communication port used by HCCL is dynamically allocated by the OS.

Example

1
2
3
4
5
6
7
8
// Method 1: Set this environment variable to a port range.
export HCCL_HOST_SOCKET_PORT_RANGE="60000-60050"
// Method 2: Use a specific port number and port ranges together, and separate them with commas (,).
export HCCL_HOST_SOCKET_PORT_RANGE="60000,60050-60100,60150-60160"
// Method 3: Specify port numbers, and separate them with commas (,).
export HCCL_HOST_SOCKET_PORT_RANGE="56000,56005,56007,56008,56100,56105,56107,56108"
// Method 4: The OS dynamically allocates port numbers.
export HCCL_HOST_SOCKET_PORT_RANGE="auto"

Restrictions

  • If multiple service processes share one NPU, you are advised to configure this environment variable. Otherwise, the service may fail to run due to port conflicts. However, multiple processes affect resource overheads and communication performance.
  • This environment variable has a higher priority than HCCL_IF_BASE_PORT. If it is configured, the communication port used by HCCL on the host is subject to this environment variable.
  • For the Atlas A2 training products / Atlas A2 inference products , if there are MC² operators (such as AllGatherMatmul, MatmulReduceScatter, and AlltoAllAllGatherBatchMatMul) on the network, this environment variable cannot be configured.

Applicability

Atlas A2 training products / Atlas A2 inference products (For Atlas A2 training products / Atlas A2 inference products , only the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack are supported.)

Atlas A3 training products / Atlas A3 inference products