HCCL_RDMA_TIMEOUT

Description

Configures the retransmission timeout of the RDMA NIC.

The minimum retransmission timeout of the RDMA NIC is calculated as follows: 4.096 μs × 2 ^ timeout. In the formula, timeout is the value of this environment variable, and the actual retransmission timeout is related to the user network status.

  • For the Atlas A3 training products/Atlas A3 inference products, set this environment variable to an integer ranging from 5 to 20. The default value is 20.
  • For the Atlas A2 training products/Atlas A2 inference products, set this environment variable to an integer ranging from 5 to 20. The default value is 20.
  • For the Atlas training products, set this environment variable to an integer ranging from 5 to 24. The default value is 20.
  • For the Atlas inference products, set this environment variable to an integer ranging from 5 to 24. The default value is 20.

Example

# If the retransmission timeout of the RDMA NIC is set to 6, the minimum retransmission timeout is 4.096 μs × 2 ^ 6 when the RDMA function is enabled on the NIC.
export HCCL_RDMA_TIMEOUT=6

Restrictions

None

Applicability

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products (For Atlas A2 training products/Atlas A2 inference products, only the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack are supported.)

Atlas training products

Atlas inference products (For the Atlas inference products, only the Atlas 300I Duo inference card is supported.)