Analyzing Cluster Data to Determine Whether Communication Retransmission Occurs
In distributed training, if a communication operator takes more than 4 seconds (the typical threshold for communication retransmission), packet retransmission may occur after the possibility of fast and slow card issues is excluded. It is difficult to distinguish the two cases based on the performance data of a single card. You can further locate the fault using MindStudio Insight.
Typically, communication retransmission occurs in performance jitter between clusters. You can use MindStudio Insight to compare the profile data of normal and abnormal steps to confirm the problem.
- On the MindStudio Insight cluster summary page, you can view the time consumption distribution of different steps, as shown in Figure 1.
- Compare the overview results of two steps. The computing time and free time of each card are close, indicating that there is no obvious fast and slow card problem. The main difference lies in the communication time (overlapped communication time + non-overlapped communication time).
- Compare all cards at the same time. The communication time difference of each card in steps 11 and 12 is about 4.7 seconds.
- As shown in Figure 2, the operator LinearWithGradAccumulationAndAsyncCommunication is used for synchronization. The subsequent task (AscendCL@aclnnTopK) is blocked during delivery.
According to the data analysis results, the NPU (computing) and CPU (scheduling) operate normally. The main difference in communication time between cards is caused by point-to-point communication timeouts, which are likely due to packet retransmissions triggered by network exceptions.
- After communication retransmissions are detected, check whether the switch network is correctly configured. For example, verify whether the PFC mechanism is configured. If PFC is not enabled, network congestion may occur, leading to communication retransmissions.

