HCCL_EXEC_TIMEOUT

Description

During distributed training or inference, tasks executed by different device processes may be inconsistent, for example, only specific processes save the checkpoint data. This environment variable controls the synchronization wait time during task execution between devices. Within this configured time, each device process waits for other devices to perform communication synchronization.

  • For the Atlas A3 training products / Atlas A3 inference products , the value range is [0, 2147483647], in seconds. The default value is 1836. The value 0 indicates that the session never times out.

    If the orchestration expansion position of the algorithm is set to AIV (for details, see HCCL_OP_EXPANSION_MODE), the value range of this environment variable is [0, 1091]. The default value is 1091. If the value is 0 or exceeds the maximum value 1091, the value 1091 is used.

  • For the Atlas A2 training products / Atlas A2 inference products , the value range is [0, 2147483647], in seconds. The default value is 1836. The value 0 indicates that the session never times out.

    If the orchestration expansion position of the algorithm is set to AIV (for details, see HCCL_OP_EXPANSION_MODE), the value range of this environment variable is [0, 1091]. The default value is 1091. If the value is 0 or exceeds the maximum value 1091, the value 1091 is used.

  • For the Atlas training products , the value range is (0, 17340], in seconds. The default value is 1836.

    Note: For the Atlas training products , actual timeout interval set in the system = Value of this environment variable // 68 × 68 (unit: s). If the value of the environment variable is smaller than 68, 68s is used by default.

    For example, if HCCL_EXEC_TIMEOUT is set to 600, the actual timeout interval is 600 // 68 × 68 = 8 × 68 = 544s.

  • For the Atlas inference products , the value range is (0, 17340], in seconds. The default value is 1836.

    Note: For the Atlas inference products , actual timeout interval set in the system = Value of this environment variable // 68 × 68 (unit: s). If the value of the environment variable is smaller than 68, 68s is used by default.

    For example, if HCCL_EXEC_TIMEOUT is set to 600, the actual timeout interval is 600 // 68 × 68 = 8 × 68 = 544s.

In normal cases, you need to retain the default configuration. When the default value cannot meet the requirements for communication synchronization between devices, set this environment variable to increase the synchronization wait period between devices.

Example

export HCCL_EXEC_TIMEOUT=1800

Restrictions

If you call the HCCL C APIs to initialize a communicator with specific configurations and set the hcclExecTimeOut parameter in HcclCommConfig to the synchronization wait period between devices, the configuration of the communicator is used.

Applicability

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products (For Atlas A2 training products / Atlas A2 inference products , only the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack are supported.)

Atlas training products

Atlas inference products (For the Atlas inference products , only the Atlas 300I Duo inference card is supported.)