Communication

The Communication tab page breaks down communication metrics by communication domains. If the communication duration is too long on the Summary tab page, check whether slow cards or slow links exist on the Communication tab page. You can switch between the Communication Matrix and Communication Duration Analysis views, as shown in the red box in Figure 1.

Figure 1 Switching between the Communication Matrix and Communication Duration Analysis views

Procedure

  1. On the Communication page, select Communication Duration Analysis to view the Visualized Communication Duration chart and check whether the Transmit Time takes up a large ratio of the communication time, as shown in Figure 2.
    If Transmit Time is excessive, the problem is due to a slow link. If Synchronization Time is excessive, the problem is caused by a slow card.
    Figure 2 Visualized Communication Duration
  2. If the transmission time proportion is too high, select the Communication Matrix, and check whether the transmission bandwidth is far lower than the empirical bandwidth, as shown in Figure 3. If the transmitted data volume is sufficient but the bandwidth is below the expected level, tuning is required. Common causes of slow links include communication retransmission, small communication packets, and unaligned data packet bytes. For relevant cases, see Communication Tuning Solutions.
    Figure 3 Communication matrix
  3. If the Transmit Time proportion is low but the Wait Time or Synchronization Time proportion is high, there are fast and slow card issues. In this case, you need to select the Communication Duration Analysis. View the Communication operator thumbnail to determine the slow cards, as shown in Figure 4. For the hcom allGather collective communication operators in green, cards 4, 5, and 13 with short duration are slow cards, and cards (such as cards 11 and 14) with long duration are relatively fast cards. Next, you need to analyze what the slow card is doing during the idle period. Go to the Timeline page to view the comparison. For details, see Fast and Slow Card Locating Case on the Timeline.
    Figure 4 Locating the slow cards

Switching Between Communication and Timeline Pages

  • You can switch between the Communication and Timeline pages based on communication operators, as shown in Figure 5 and Figure 6.

    If an abnormal card or communication operator is located on the Communication page, you can switch to the Timeline page to further locate the root cause. Conversely, if a long-running communication operator is found on the Timeline, check the Communication page for slow cards causing delays in the same domain.

    Figure 5 Switching from the Communication page to the Timeline page by operators
    Figure 6 Switching from the Timeline to the Communication page by operators