HCCL_EXEC_TIMEOUT
Description
During distributed training or inference, tasks executed by different device processes may be inconsistent, for example, only specific processes save the checkpoint data. This environment variable controls the synchronization wait time during task execution between devices. Within this configured time, each device process waits for other devices to perform communication synchronization.
- For the
Atlas A3 training products /Atlas A3 inference products , the value range is [0, 2147483647], in seconds. The default value is 1836. The value 0 indicates that the session never times out.If the orchestration expansion position of the algorithm is set to AIV (for details, see HCCL_OP_EXPANSION_MODE), the value range of this environment variable is [0, 1091]. The default value is 1091. If the value is 0 or exceeds the maximum value 1091, the value 1091 is used.
- For the
Atlas A2 training products /Atlas A2 inference products , the value range is [0, 2147483647], in seconds. The default value is 1836. The value 0 indicates that the session never times out.If the orchestration expansion position of the algorithm is set to AIV (for details, see HCCL_OP_EXPANSION_MODE), the value range of this environment variable is [0, 1091]. The default value is 1091. If the value is 0 or exceeds the maximum value 1091, the value 1091 is used.
- For the
Atlas training products , the value range is (0, 17340], in seconds. The default value is 1836.Note: For the
Atlas training products , actual timeout interval set in the system = Value of this environment variable // 68 × 68 (unit: s). If the value of the environment variable is smaller than 68, 68s is used by default.For example, if HCCL_EXEC_TIMEOUT is set to 600, the actual timeout interval is 600 // 68 × 68 = 8 × 68 = 544s.
- For the
Atlas inference products , the value range is (0, 17340], in seconds. The default value is 1836.Note: For the
Atlas inference products , actual timeout interval set in the system = Value of this environment variable // 68 × 68 (unit: s). If the value of the environment variable is smaller than 68, 68s is used by default.For example, if HCCL_EXEC_TIMEOUT is set to 600, the actual timeout interval is 600 // 68 × 68 = 8 × 68 = 544s.
In normal cases, you need to retain the default configuration. When the default value cannot meet the requirements for communication synchronization between devices, set this environment variable to increase the synchronization wait period between devices.
Example
export HCCL_EXEC_TIMEOUT=1800
Restrictions
If you call the HCCL C APIs to initialize a communicator with specific configurations and set the hcclExecTimeOut parameter in HcclCommConfig to the synchronization wait period between devices, the configuration of the communicator is used.