HCCL_MULTI_QP_THRESHOLD

Description

Sets the minimum amount of data shared by each QP during RDMA communication between ranks through multi-QPs.

The value of this environment variable must be an integer ranging from 1 to 8192, and the default value is 512, in KB.

  • If the value of (data size of a single communication between ranks/the configured value of HCCL_RDMA_QPS_PER_CONNECTION) is less than the configured value of HCCL_MULTI_QP_THRESHOLD), the number of QPs is automatically reduced during HCCL execution so that the data size shared by each QP is greater than or equal to the value of HCCL_MULTI_QP_THRESHOLD. For example:

    If the data size of a single communication between ranks is 1 MB, HCCL_RDMA_QPS_PER_CONNECTION is set to 4, and HCCL_MULTI_QP_THRESHOLD is set to 512, which requires that each QP needs to share at least 512 KB data, the number of QPs is reduced to 2 during HCCL execution, that is, only two QPs are used for data transmission between ranks.

  • If the data size of a single communication between ranks is less than HCCL_MULTI_QP_THRESHOLD, single-QP data transmission is used.
  • If the data size shared by each QP is greater than 512 KB and the HCCL Test tool is used to test the RDMA traffic (only the inter-device traffic is tested, and the HCCS link is not used), the delivery scheduling overhead in the multi-QP scenario deteriorates by less than 3% compared with that in the single-QP scenario.

You can use the environment variable HCCL_RDMA_QPS_PER_CONNECTION or HCCL_RDMA_QP_PORT_CONFIG_PATH to enable multi-QP communication.

Example

export HCCL_MULTI_QP_THRESHOLD=512

Restrictions

This environment variable supports only the single-operator calling mode and does not support the static graph mode.

Applicability

Atlas A2 training products/Atlas A2 inference products (For Atlas A2 training products/Atlas A2 inference products, only the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack are supported.)

Atlas A3 training products/Atlas A3 inference products