HCCL_CONNECT_TIMEOUT
Description
Configures the timeout wait period of socket connection establishment between different devices in the distributed training or inference scenario. The progress of collective communication initialization varies depending on the device. This environment variable synchronizes the progress of socket establishment between devices by using a timeout interval.
The value of this environment variable must be an integer ranging from 120 to 7200, and the default value is 120, in seconds.
Note: The actual timeout interval for socket establishment is the value of this environment variable plus 20 seconds. For example, if this environment variable is set to 150 seconds, the actual timeout interval is 170 seconds. The extra 20 seconds are used to notify each node of the cause of the communicator initialization failure.
The value of this environment variable affects the exception reporting time in the connection fault scenario.
Example
export HCCL_CONNECT_TIMEOUT=200
Restrictions
None