Summary
The Summary page provides the following functions: parallel strategy analysis, pipeline parallel analysis, comparison of multi-card computing, communication, and scheduling, and MoE expert load balancing analysis.
Preliminary Demarcation
You need to compare the multi-card computing, communication, and scheduling time to determine whether any component accounts for an unusually high proportion, and to check for serious inter-card asynchronization or large communication time fluctuations between cards, which may indicate fast and slow cards, as shown in Figure 1.
The common operations are as follows:
- Configure a correct parallel strategy and make sure the parameter settings are consistent with those used in model training and inference. For detailed parallel parameter configurations, consult the model development personnel.
- If the number of devices is small, it is recommended to use full-dimensional parallelism (i.e., DP + PP + TP).
- Select the required performance metrics to generate a heat map for quick horizontal comparison. During fast and slow card analysis, focus on the communication time in specific parallel domains.
- View the parallel strategy layout diagram. The heat map rendering effect enables efficient horizontal comparison.
- If the parallel strategy is correctly configured, the slow card advice is provided.
- On the Computation/Communication Overview page in the lower part of the page, view the comparison of the computing, communication, and scheduling time of each card to preliminarily determine whether there are issues related to computing, communication, delivery, and slow cards.
Typical Cases
- Typical case 1: As shown in Figure 2, the communication time between cards fluctuates significantly and the cards are not synchronized. The proportions of computing and free (delivery) time are inversely related to communication time. The card with a low communication time proportion but high computing and free (delivery) time proportions is identified as a slow card. This indicates that the cluster has a fast- and slow-card issue. For details about how to further locate the fast and slow card issues, see Fast and Slow Card Locating Case on the Timeline.
- Typical case 2: As shown in Figure 3, the free time proportion is high, indicating that the cluster has a delivery bottleneck. In this case, locate and optimize the cluster by referring to Host Bound Troubleshooting. The communication time proportion is also high, and the communication time of each card fluctuates. In this case, locate and optimize the cluster by referring to Communication Tuning Solutions.
If there are a large number of cards, a large amount of full data is displayed, which is inconvenient for viewing and analysis, as shown in Figure 4. You need to split data properly to make the analysis direction clearer.
Simplification method 1: Click a communication domain connection line (illustration① in the figure) to view the communication domain independently. The overview after breakdown by communication domain is displayed. Clicking a box has the same effect. Each line represents a communication domain, and each box represents a parallel group.Figure 5 Computation/Communication overview of a communication domain
Simplified method 2: View the folded view and locate the fault from the overall to the partial. Take a 512-card cluster whose parallel strategy is DP8, PP8, and TP8 as an example. There are 512 cards in the full view (that is, in the DP + PP + TP). After the TP dimension is folded (DP + PP), every eight TP-domain cards are combined into one node, resulting in 64 folded nodes. You can first select the DP + PP dimension to identify the slow group, and then use the full DP + PP + TP view to pinpoint the slow card.- In the DP + PP dimension, select DP-Communication for Performance Metric. As shown in Figure 6, there are slow groups whose DP indexes are 4 and 7.
- Take the parallel group whose DP index is 4 as an example. Right-click 4 and click Expand to go to the DP + PP + TP dimension, as shown in Figure 7. In this case, set Performance Metric to TP-Communication. Card 38 is identified as the slow card. This card impacts the TP domain (32–39) and further affects DP Indexes 0 to 7 in Figure 6.
- After the slow card is located, right-click the connection line of a communication domain (for example, the green line indicates the TP communication domain in the figure) to view the communication duration analysis. Go to the Communication page to further analyze the communication process of the slow card, as shown in Figure 8 and Figure 9.







