Overview

Communication issues are typically characterized by excessively long cluster communication time, significantly exceeding the computing time, as shown in Figure 1. Alternatively, specific communication operators may exhibit abnormally long durations. For example, in Figure 2, the time of reduceScatter operators is much longer than the computing stream, as marked by ① and ②.

Figure 1 Communication issue 1: longer communication time than computing
Figure 2 Typical communication issue 2: long-duration communication operators

Note that communication issues can be caused either by slow data transmission or by other slow cards (that is, the fast and slow cards).

To check whether the issue is caused by the transmission or the fast and slow cards, perform the following operations:

  • Go to the Communication Duration Analysis tab page on MindStudio Insight. If the Transmit Time proportion of the card is high, the communication transmission is faulty. If the Synchronization Time proportion is high, there are slow and fast cards. For details, see Communication.
    Figure 3 Communication duration analysis
  • You can compare the multi-card computing, communication, and free time on the Summary tab page to check whether the problem is caused by fast and slow cards. For details, see Summary. Figure 4 shows a typical example. If the free time of each card is negatively correlated with the communication time (that is, a longer free time indicates a shorter communication time, and a shorter free time indicates a longer communication time), there is a high probability that the cluster contains fast and slow cards caused by delivery performance fluctuation. Similarly, there is a problem that the computing time is negatively correlated with the communication time.
    Figure 4 Fast and slow card issue

If there are slow and fast cards, locate the cause by referring to Fast and Slow Card Troubleshooting.

If the issue is not caused by the fast and slow cards, check the communication. The possible causes are Small Communication Packets, Communication Retransmission, Byte Misalignment Between the Source and Destination Addresses, Profiling overhead, or Computing and Communication Bandwidth Contention.