Feature
Re-execution
Re-execution can be enabled for MC² operators by configuring the compilation macro and environment variable, which prevents communication interruption caused by hardware intermittent disconnection in the environment where the communication task is executed. After re-execution is enabled for MC² operators, the AI CPU notifies the AI Core to re-deliver communication tasks using the mechanism shown in the following figure when detecting an environment exception, preventing communication interruption caused by hardware intermittent disconnection and improving communication stability.
Currently, the support for this capability is as follows:
The re-execution conditions are as follows:
- The output memory address of the MC² operator is different from the input memory address.
- The compilation macro AICORE_EXCEPTION_RESTART is configured during operator compilation. For details about the compilation macro configuration phase and method, see Supported Customization Options.
1add_ops_compile_options(ALL OPTIONS -DAICORE_EXCEPTION_RESTART)
- Configure the HCCL_OP_RETRY_ENABLE environment variable for HCCL re-execution to enable the re-execution detection and reporting capability. For details about the environment variable, see Environment Variables"Collective Communication" > "HCCL_OP_RETRY_ENABLE". Set the environment variable before operator execution as follows:
1 2
# Set L0 within a server and L1 between servers to 1. This operation is not supported across supernodes. Set L2 to 0. export HCCL_OP_RETRY_ENABLE="L0:1, L1:1, L2:0"
Note that after re-execution is enabled, if the communication is interrupted after the AI Core delivers a communication task for the first time, re-execution is performed only once by default. For details about how to change the re-execution count or retransmission interval, see Environment Variables"Collective Communication" > "HCCL_OP_RETRY_PARAMS".