Fast and Slow Card Locating Case on the Timeline

This section describes how to locate fast and slow card faults using MindStudio Insight.

If the fast and slow card issue has been preliminarily located in the cluster (for details, see Summary), and the transmit time is low while wait or synchronization time is high (for details, see Communication), the cluster has the issue. In this case, view Communication Time Analysis. The communication operators are tiled horizontally to determine where the slow card is.

As shown in Figure 1, for the hcom allGather collective communication operators in green, cards 4, 5, and 13 with short duration are slow cards, and cards (such as cards 11 and 14) with long duration are relatively fast cards.

The next step is to analyze what the slow card is doing during the idle period. You need to go to the Timeline module to view the specific differences.

Figure 1 Communication operator thumbnail
  1. Right-click a communication operator and choose Find in Timeline from the shortcut menu. On the Timeline view displayed, mark the approximate range with a flag to facilitate subsequent locating.
    Figure 2 Switching from the communication operator to the Timeline view
    Figure 3 Marking the approximate range with a flag
  2. Select a step, click Fit to screen, and pin it to the top to compare the Overlap Analysis lanes of the slow card (card 13) and fast card (card 14) to determine the source of the difference. (The Overlap Analysis lanes shows the computing and communication tasks at the Ascend Hardware layer in a unified manner, making the comparison clearer.)
    Figure 4 Limiting the troubleshooting area to a step (full-screen display by step)
    Figure 5 Pinning the slow card (card 13) and fast card (card 14) to the top to compare the Overlap Analysis lanes and determine the source of the difference
  3. Based on statistics from the selected area, the number of slow card lane computing tasks is much greater than that of fast card lane computing tasks.
    Figure 6 Selected area statistics

    The difference mainly comes from the second half, indicating that the problem is caused by load imbalance.

    Figure 7 Operator quantity comparison
  4. Determine the Python API that delivers the extra operators based on async_npu. (You can select specific area to display connection lines instead of full display.)
    Figure 8 async_npu delivery connections

    According to the statistics from the selected area at the Ascend hardware layer, for an API on the Python side, the fast card delivers 1,218 computing operators and the slow card delivers 3,303 computing operators. Therefore, the slow card is caused by the unbalanced load of the API on the Python side.

  5. Confirm with the model development personnel whether the load imbalance can be avoided.