HCCL_EXEC_TIMEOUT
Description
During distributed training or inference, tasks executed by different device processes may be inconsistent, for example, only specific processes save the checkpoint data. This environment variable controls the synchronization wait time during task execution between devices. Within this configured time, each device process waits for other devices to perform communication synchronization.
- For the
Atlas Training Series Product , the value range is (0, 17340], in seconds. The default value is 1836.Note: For the
Atlas Training Series Product , actual timeout interval set in the system = Value of this environment variable // 68 x 68 (unit: s). If the value of the environment variable is smaller than 68, 68s is used by default.For example, if HCCL_EXEC_TIMEOUT is set to 600, the actual timeout interval is 600 // 68 x 68 = 8 x 68 = 544s.
In normal cases, you need to retain the default configuration. When the default value cannot meet the requirements for communication synchronization between devices, set this environment variable to increase the synchronization wait period between devices.
Example
export HCCL_EXEC_TIMEOUT=1800
Restrictions
None