HCCL_OP_RETRY_PARAMS

Description

Configures the wait period for the first retry, the maximum number of retries, and the interval between two retries after the HCCL operator retry feature is enabled through HCCL_OP_RETRY_ENABLE.

The configuration method is as follows:

export HCCL_OP_RETRY_PARAMS="MaxCnt:3,HoldTime:5000,IntervalTime:1000"
  • MaxCnt: maximum retransmission attempts. The value is of the uint32 type. The value range is [1,10]. The default value is 1.
  • HoldTime: wait period from the time when a communication operator execution failure is detected to the time when the communication operator is retried for the first time. The value is of the uint32 type. The value range is [0,60000], with the default value of 5000, in millisecond.
  • IntervalTime: interval between two retries of the same communication operator. The value is of the uint32 type. The value range is [0,60000], with the default value of 1000, in millisecond.

Example

export HCCL_OP_RETRY_PARAMS="MaxCnt:5,HoldTime:5000,IntervalTime:5000"

Restrictions

  • This environment variable takes effect only when the HCCL retry feature is enabled through HCCL_OP_RETRY_ENABLE (the retry feature of any level is enabled).
  • If you call the HCCL C APIs to initialize a communicator with specific configurations and set the waiting time for first retry using the hcclRetryParams parameter of HcclCommConfig, the configuration of the communicator takes precedence.

Applicability

Atlas A3 training products / Atlas A3 inference products