Overview
Fast and slow cards are relative concepts. A fast card is one that completes a computing task earlier in a cluster, while a slow card finishes the same task later. In a cluster, collective communication requires coordination between cards. If different cards complete their tasks at different times, fast cards need to wait for slow cards before communication can proceed, which leads to overall cluster performance degradation.
Fast and slow cards can be caused by various reasons. The general troubleshooting approach is to use precise divergence point analysis to compare the differences of fast and slow cards on the Timeline tab of MindStudio Insight, to determine the specific root cause.
The common causes include load imbalance and performance fluctuation of computing, host delivery, and data loading.
The main contents are as follows:
- Precise Divergence Point Analysis for Fast and Slow Cards: General roadmap for fast and slow card analysis
- Fast and Slow Card Locating Case on the Timeline: Using the Timeline tab of MindStudio Insight to locate fast and slow card issues
- Case on Operator Comparison for Locating Fast and Slow Cards: Using the operator comparison function to locate fast and slow issues
- Fast and Slow Card Cases: More typical cases of fast and slow cards