Case on Operator Comparison for Locating Fast and Slow Cards

If the slow and fast cards in the cluster are caused by the fluctuation of the computing time on the Summary page of MindStudio Insight, in addition to the method mentioned in Precise Divergence Point Analysis for Fast and Slow Cards, you can compare the time consumption of the slow and fast card operators to quickly locate the difference source.

Locating the Fast and Slow Cards on the Summary Page

Go to the Computation/Communication Overview area on the Summary page of MindStudio Insight. In this example, cards 0 to 7 are slow computing cards (long computing time and short communication time), and cards 8 to 15 are fast computing cards (short computing time and long communication time). The long communication time of the latter is caused by waiting for the former.

Figure 1 Computation/Communication Overview page

Comparing operator differences

As described in Operator, you can quickly locate the operators that cause the time consumption difference. As shown in Figure 2, set cards 7 and 8 to the inter-card comparison mode and sort them by total time consumption in ascending order. If the number of operators on the fast and slow cards differs greatly, the computing load is unbalanced. In this case, confirm with the model development personnel whether the load imbalance can be avoided. If the number of operators of a certain type is the same but the average time consumption differs, contact the operator development owner or use the method in Precise Divergence Point Analysis for Fast and Slow Cards to further locate the root cause on the Timeline.

Figure 2 Operator comparison between cards

Similarly, you can use the compare tool to go to the KernelCompare comparison page and analyze operator differences. For details, see Quick Analysis for Model Tuning (msprof-analyze CLI).

Figure 3 KernelCompare comparison page of the compare tool