A Typical Case of Communication Retransmission

Symptom

When the LLaMA3-70B model is migrated from a four-node cluster to a 32-node cluster, the linearity deteriorates.

Analysis

Use the tool described in Quick Analysis for Model Tuning (msprof-analyze CLI) to analyze the normal 4-node cluster and abnormal 32-node cluster. The comparison shows that the time difference is due to communication, with the 32-node cluster experiencing longer overall communication, as shown in Figure 1.
Figure 1 cluster_step_trace_time.csv deliverable (comparison between a normal 4-node cluster and an abnormal 32-node cluster)
Take card 0 as an example. Compare card 0 in the normal cluster with that in the abnormal cluster. The problem occurs on the allReduce and broadcast operators at the end of the iteration, as shown in Figure 2.
Figure 2 Timeline of card 0 in the abnormal cluster
According to the dst rank (target card, which generally indicates the slow card in the Notify Wait communication event) in the communication operator selection details, continuously redirect to locate the cause. It is found that the card 440 affects other cards in the same TP communication domain, as shown in Figure 3.
Figure 3 Timeline of card 440 in the abnormal cluster
On the Communication tab page, select Communication Duration Analysis and find the corresponding communication domain. Sort the cards by Wait Time in ascending order and find the cards with short wait time and long transmission time in the communication domain. See Figure 4.
Figure 4 Communication duration analysis on the Communication page
Check the Bandwidth Analysis of the cards with long transmission time. In Figure 5, a large number of RDMA communication packets exist with extremely low bandwidth. In this case, the network transmission may be faulty. Check the network configuration.
Figure 5 Bandwidth analysis

Troubleshooting

Network configuration analysis reveals that traffic between the switch and the compute node server passes through a PFC-free congestion control queue, resulting in substantial packet loss and subsequent RDMA packet retransmissions. Correctly setting the related environment variables can eliminate the problem.