HCCL_OP_RETRY_ENABLE

Description

Enables or disables the retry feature of the HCCL operator. HCCL operator retry is based on the communicator. If an SDMA or RDMA CQE error is reported during the execution of a communication operator, HCCL attempts to retry the communication operator.

In a cluster environment, intermittent hardware disconnection may occur. In this case, an error is reported during the execution of the communication operator. You can use this environment variable to enable the retry feature of HCCL to prevent communication interruption caused by intermittent hardware disconnection and improve communication stability. HCCL operator retry is to provide a best-effort fault recovery method at the software layer. The following figure shows the retry process.

Figure 1 Retry process

The workflow consists of the following three steps:

Fault detection: The AI CPU detects a fault signal and notifies the host to start the retry process.
Cluster management: The host exchanges information through the host socket and determines whether the faulty operator meets a series of retry conditions. For details, see Precautions for Using the Retry Feature.
Re-delivery: The AI CPU kernel is instructed to re-deliver the SQE and WQE to retry the HCCL operator.

Configuration

You can use this environment variable to configure whether to enable the retry feature in the communicators of two physical layers between servers and supernodes. Each layer supports two states: enabled and disabled.

The configuration method is as follows:

export HCCL_OP_RETRY_ENABLE="L1:0,L2:0"

L1 indicates that the physical scope of the communicator is the communicator between servers. 0 (default value) indicates that the retry feature is disabled for inter-server communication tasks in the communicator, while 1 indicates that the retry feature is enabled for inter-server communication tasks in the communicator.
L2 indicates that the physical scope of the communicator is the communicator between supernodes. 0 (default value) indicates that the retry feature is disabled for inter-supernode communication tasks in the communicator, while 1 indicates that the retry feature is enabled for inter-supernode communication tasks in the communicator.
If L2 is set to 1 and a device NIC is faulty during inter-supernode communication, the standby device NIC is used for communication during retry. This is called "link-failover communication". The standby NIC is the NIC of another die on the same NPU. For details about the conditions for normal link-failover communication and the impact of link-failover communication, see Precautions for Using Link-Failover Communication.
- If the communicator is created based on the rank table file, you need to configure the backup NIC using the backup_device_ip parameter in the rank table file.
- If the communicator is created based on root node information, the two dies on the same NPU automatically work as the backup NICs for each other and do not need to be manually configured.

In addition, you can use the environment variable HCCL_OP_RETRY_PARAMS to configure the waiting time for the first retry, maximum number of retries, and interval between two retries.

Recommended configuration:

Enabling the retry feature will cause certain performance loss. For the Atlas A3 training products / Atlas A3 inference products , the optical interconnection domain is used between servers and supernodes, which is relatively unstable. Therefore, you are advised to enable the HCCL retry feature.
The configuration of this environment variable on each supernode must be the same. Otherwise, the link setup between supernodes will time out.

Precautions for Using the Retry Feature

When the HCCL retry feature is enabled, the following constraints must be met. Otherwise, the retry will fail.

The location for expanding the orchestration of the communication algorithm is in the AI CPU computing unit of the device. The retry feature is enabled only in AI CPU scheduling mode through HCCL_OP_EXPANSION_MODE. The process without retry will be used in non–AI CPU scheduling mode.
```
export HCCL_OP_EXPANSION_MODE="AI_CPU"
```
In the scenario where the communicator is created based on the rank table, the host_ip field in the rank table must be configured. Otherwise, the retry feature does not take effect and the process without retry will be used.
The input memory of the communication operator must not be at risk of being corrupted during execution.
A collective communication operator is a combination of a series of tasks. HCCL retry is based on the communication operator. The series of tasks of a communication operator are retried starting from the operator's input memory. If the input memory of the communication operator may be corrupted during execution, the retry may fail and the system reports an error and exits.
The following are scenarios where the input memory may be corrupted:
- Scenario where the zero-copy function is enabled: After the zero-copy function is enabled, the ReduceScatter and AllReduce operators modify the input memory of the user. Therefore, these two types of operators do not support retry.
- Scenario where the in-place operation is included: In this scenario, the input and output of the operator share the same memory, for example, the ReduceScatter/AllGather operator of PyTorch. Therefore, the scenario where the in-place operation is included does not support retry.
- Graph mode scenario: In graph mode, communication can be directly performed on the input and output of the operator. For example, the input parameter tensor of the AllReduce operator of PyTorch is used as the input and output of the operator. During the communication of the operator, the tensor content changes after part of the result is written. If the operator is executed again on the corrupted input, the computing result will be incorrect. Therefore, this scenario also does not support retry.

When the fault occurs, all ranks in the communicator stop at the same communication operator. If different ranks are stopped on different communication operators, the operator retry is not supported.

The time when a fault occurs is unpredictable. When a fault occurs, the status of each rank in the communicator is related to the retry success rate. The following figure shows a communicator that contains three ranks. Table 1 lists the retry status when a fault occurs at different time points.

Figure 2 Communicator fault diagram 1

**Table 1** Communicator fault retry status
Fault Occurrence Time	Retry Supported	Retried Operator
A	Yes	HCCL OP1. The compute operator cannot detect the link fault. When the communication operator HCCL OP1 is executed and detects the link fault, all the three ranks stop at HCCL OP1. In this case, the retry conditions are met and the retry is started.
B	Yes	HCCL OP1. Rank 0 and rank 2 continue to be executed until the communication operator HCCL OP1 is executed. Rank 1 also stops at HCCL OP1. In this case, the retry conditions are met and the retry is started.
C	Yes	HCCL OP1.
D	No	HCCL OP1 of rank 0 and rank 1 has been executed. When a fault occurs at time D, the execution continues to HCCL OP2, but rank 2 still stops at HCCL OP1. In this case, the retry conditions are not met.
E	Yes	HCCL OP3. All the three ranks continue to be executed and finally stop at HCCL OP3. In this case, the retry conditions are met and the retry is started.

The following uses the common algorithm Recursive Halving-Doubling (RHD) of collective communication as an example to describe why collective communication cannot ensure that the execution stops at the same communication operator when a fault occurs.

Figure 3 Communicator fault diagram 2

Assume that there are four AI servers, each with a rank, forming a communicator with four ranks. If a fault occurs after the first step of data exchange in the HD algorithm, the following situations may occur:

Rank 2 and rank 3 can run properly, but rank 0 and rank 1 cannot. The subsequent compute or communication operators of rank 2 and rank 3 may use any memory, and the corresponding context information cannot be found on rank 2 and rank 3 when the operators are executed again. Therefore, if the fault occurs at the time shown in the preceding figure, retry cannot be performed.

Check whether the socket network communication on the host is normal. During retry, the socket communication on the host is used to negotiate the status of each device in the communicator. If the socket network is faulty, retry cannot be performed.
Ensure that the faulty link is recovered. For example, route convergence is successful, the optical module is rectified from an intermittent disconnection, or the communication is restored by using the standby NIC. If the faulty link cannot be recovered, the communication task still fails to be executed. When the number of retries exceeds the maximum value (which is configured using HCCL_OP_RETRY_PARAMS), the operator fails to be retried.

If the debug log on the host contains the error information with the keyword "[OpRetry]...timeout", the socket communication on the host is abnormal during HCCL retry. In this case, you can collect logs of all nodes in the communicator to further locate the fault.
If the debug log on the host contains the error information with the keyword "can not retry", the HCCL retry conditions are not met in the current scenario.

The default path for storing debug logs generated by applications on the host is $HOME/ascend/log/debug/plog/.

Precautions for Using Link-Failover Communication

To ensure that the link-failover communication function can be properly executed, the following conditions must be met:
- The communication link of the standby NIC is normal.
- Both devices in active/standby mode must be visible to services.
  For example, if NPU 1 contains two dies, device 0 and device 1, which work in active/standby mode, and only device 0 is visible to services (specified by the environment variable ASCEND_RT_VISIBLE_DEVICES), the link failover function cannot be used.
If link failover occurs in a communication process (assuming that the die 0 NIC of an NPU is faulty, and the standby die 1 NIC is enabled), traffic of the original die 0 NIC is also sent and received using the die 1 NIC. As a result, traffic of the die 1 increases, and the overall performance deteriorates due to halved physical bandwidth and port conflicts.
In the scenario where link-failover communication is enabled, if the die 0 NIC of NPU 0 is faulty, the standby die 1 NIC is used. Because the communication between two NPUs requires that the local and peer ends be switched to the standby NICs at the same time, NPU 1 is also switched from die 0 to die 1, as shown in "Figure 2" in the following figure. However, if there is a communication task between die 0 and die 1, the link failover function cannot be used.
Figure 4 Example of link-failover communication switchover
When the link-failover communication function is enabled, it is recommended that two dies of an NPU be allocated to the same training or inference task.
If two dies of the same NPU are allocated to two different training or inference tasks and one task is faulty, the NIC of the other task is used. As a result, the performance of both tasks deteriorates to some extent.
In the same NPU, link failover can occur once only, and switchback is not supported.
As shown in Figure 5, the communication link between NPU 0 and NPU 1 is faulty in "Figure 1", and the standby link is used. Link failover is enabled, so the communication is normal. If the fault occurs again in "Figure 2", link failover is not supported, and an error is reported and the system exits.
Figure 5 Example of link failover in the same NPU

Troubleshooting

If the "[OpRetryConnection][RecvAckTag] Recv unmatched ack" error occurs after the retry feature is enabled, the default port used for HCCL communication may be occupied. As a result, HCCL connects to an incorrect server. To solve this problem, perform the following steps:

Run the sysctl -w net.ipv4.ip_local_reserved_ports command to reserve the default port 60000-60015 used by HCCL to prevent the port from being randomly allocated by the operating system.
```
sysctl -w net.ipv4.ip_local_reserved_ports=60000-60015
```

If the error persists, change the default port used by HCCL through HCCL_IF_BASE_PORT and run the sysctl -w net.ipv4.ip_local_reserved_ports command to reserve the specified ports.

# Specify that the HCCL uses 16 consecutive ports starting from port 17777.
export HCCL_IF_BASE_PORT=17777
# Ports 17777 to 17792 are reserved.
sysctl -w net.ipv4.ip_local_reserved_ports=17777-17792

Other Constraints

If you call the HCCL C APIs to initialize a communicator with specific configurations and specify whether to enable the HCCL operator retry feature using the hcclRetryEnable parameter of HcclCommConfig, the configuration of the communicator takes precedence.

Relationship Between Retry and Overall Network Performance

For details, see Relationship Between Communication Operator Retry and Overall Network Performance.

Applicability

Atlas A3 training products / Atlas A3 inference products

Parent topic: Reliability