Configuring Operator-Level Online Recovery
This section describes how to configure operator-level online recovery. For details about its features, restrictions, supported products, and working principles, see Operator-Level Online Recovery.
Configuring Environment Variables
Before enabling operator-level online recovery, you need to configure HCCL_OP_RETRY_ENABLE and HCCL_OP_RETRY_PARAMS in the script for starting training. For details about the environment variables, see CANN Environment Variable Reference. Configuration example:
export HCCL_OP_RETRY_ENABLE="L0:0, L1:1, L2:1" # Whether to enable HCCL operator re-execution. export HCCL_OP_RETRY_PARAMS="MaxCnt:3, HoldTime:5000, IntervalTime:1000" # Set the parameters for HCCL operator re-execution, including the maximum number of re-execution times, the waiting time for the first re-execution, and the interval between two re-executions.
Parent topic: Configuring Fault Handling Policies