Overview

Fast and slow cards are relative concepts. A fast card is one that completes a computing task earlier in a cluster, while a slow card finishes the same task later. In a cluster, collective communication requires coordination between cards. If different cards complete their tasks at different times, fast cards need to wait for slow cards before communication can proceed, which leads to overall cluster performance degradation.

Fast and slow cards can be caused by various reasons. The general troubleshooting approach is to use precise divergence point analysis to compare the differences of fast and slow cards on the Timeline tab of MindStudio Insight, to determine the specific root cause.

The common causes include load imbalance and performance fluctuation of computing, host delivery, and data loading.

The main contents are as follows: