EI0002 Communication Operation Timeout

Symptom

The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [%]. base information: %s task information: %s

Possible Cause

1. An exception occurs during the execution on some NPU or multiple NPUs in the cluster. As a result, collective communication operation fails.

2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval [%d seconds]. (You can set the interval by using HCCL_EXEC_TIMEOUT.)

3. The number of training samples of each NPU is inconsistent.

4. Packet loss or other connectivity problems occur on the communication link.

Solution

1. If this error is reported for some ranks, check other ranks to see whether other errors have been reported earlier.

2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 600s). If not, locate the cause or adjust the threshold by using the HCCL_EXEC_TIMEOUT environment variable.

3. Check whether the completion queue element (CQE) of the error exists in the plog. If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)

4. Ensure that the number of training samples of each NPU is consistent.