Configuring Operator-Level Online Recovery

This section describes how to configure operator-level online recovery. For details about its features, restrictions, supported products, and working principles, see Operator-Level Online Recovery.

Configuring Environment Variables

Before enabling operator-level online recovery, you need to configure HCCL_OP_RETRY_ENABLE and HCCL_OP_RETRY_PARAMS in the script for starting training. For details about the environment variables, see CANN Environment Variable Reference. Configuration example:

export HCCL_OP_RETRY_ENABLE="L0:0, L1:1, L2:1"   # Whether to enable HCCL operator re-execution.
export HCCL_OP_RETRY_PARAMS="MaxCnt:3, HoldTime:5000, IntervalTime:1000"    # Set the parameters for HCCL operator re-execution, including the maximum number of re-execution times, the waiting time for the first re-execution, and the interval between two re-executions.