Locating Problems Using MindStudio Insight

MindStudio Insight provides a rich set of tuning and analysis tools, visually presenting real hardware and software runtime data. It allows multidimensional analysis of profile data, helps identify performance bottlenecks, and visualizes performance analysis for clusters with hundreds and thousands of cards, and even beyond. Referring to Importing Data in MindStudio Insight User Guide, you can import the profile data collected in the previous step into MindStudio Insight and use its visualization capabilities to analyze the data following the below process.

Data Overview on the Summary Page

You can learn about the details of each module on the Summary page. For details, see Summary in MindStudio Insight User Guide.

  1. In the "Parallel Strategy Analysis" module, different parallel strategy analysis perspectives are provided, such as TP, PP, and DP:
    1. Click the checkbox to select the parallelism strategy, and use the box annotations to identify parallel groups.
    2. Selecting different data types will display the corresponding heatmap. Based on the heatmap, you can identify communicators with performance issues, with darker red indicating worse performance.
    Figure 1 Parallelism strategy analysis
  2. The computation/communication overview area displays the proportion of computation, communication, and idle time for each card under the selected communicator, and provides expert recommendations. As shown in the figure below, since the communication time not overlapped for card 391 is significantly smaller than that of other cards, it can be preliminarily inferred that there are some communication issues within the cluster. Relevant expert recommendations are also provided.
    Figure 2 Computation/Communication overview
    Meanings of data metrics are as follows:

    Legend

    Meaning

    Total computation time

    Total kernel time on the Ascend device.

    Pure computation time

    Pure computation time = Total computation time – Communication time (overlapped)

    Communication time (overlapped)

    Overlapped communication time, which refers to the duration during which computation and communication occur simultaneously.

    Communication time (non-overlapped)

    Non-overlapped communication time, which refers to the pure communication duration.

    Idle time

    The duration during which neither computation nor communication is performed.

Different metrics can locate different performance issues:

  • Computation issue: Typically represented as an excessive difference between the maximum and minimum proportions of total computation time in the communicator. If the computation time of certain cards obviously exceeds the normal range, it is likely that the card is handling an excessively heavy computation task, such as processing too much data or dealing with a model of high complexity. It could also be that the card's performance is limited.
  • Scheduling issue: Typically represented as an excessive difference between the maximum and minimum proportions of idle time in the communicator. If the idle time of a compute card is too long, it indicates an abnormal dispatch from the host to the device, which can also negatively impact the performance of the cluster.
  • Communication issue: If the non-overlapped communication time is too long, it indicates a problem with the collaboration between computation and communication, which may correspond to various situations. It could be that the communication protocol is not optimized, or that network bandwidth instability is preventing communication from coordinating well with computation.

Computation Issues

When the data metrics indicate a computation issue, you can directly view the operator data of the abnormal card and compare it with that of the normal cards. At this point, you can use MindStudio Insight's inter-card performance comparison feature. Follow the Instructions in MindStudio Insight User Guide to set the two cards into comparison mode and view the results on the operator page. The following figure is an example of a computation issue. It can be seen that, under the condition of equal operator counts, the average execution time of MatMul operators has significantly increased, resulting in a computation time difference between the two cards.

Figure 3 Computation operator type

Based on experience, MatMul operators are likely to degrade under specific shapes. You can switch the grouping method to "Computing Operator Name and Input Shape", and sort by total execution time to further identify which shapes cause the most severe degradation in MatMul operators. After locating the operator issue, you can consult the operator developers to further confirm the cause of the issue.

Figure 4 Computation operator name and input shape

Scheduling Issues

When the data metrics indicate a scheduling issue, you need to go to the timeline page to compare the abnormal and normal cards, and further identify the problematic operators. You can refer to Timeline in MindStudio Insight User Guide to understand the details. Access the timeline page and select the HostToDevice connection type. The HostToDevice page displays the delivery and execution relationships from CANN operators to AscendHardware operators and from CANN operators to HCCL communication operators, which are used to locate scheduling issues.

The HostToDevice connections usually have two forms: diagonal and vertical. The following figure is an example of a scheduling issue. If the HostToDevice connection is diagonal, as shown on the left, it indicates that the scheduling of tasks during this period is reasonable, and the Ascend device is operating at full capacity, performing both computation and communication tasks. If the HostToDevice connection is vertical, as shown on the right, it indicates that the Ascend device quickly completed the tasks dispatched by the CPU but did not fully load the computation and communication tasks. This generally indicates a scheduling issue. In this case, tuning can be done by methods such as increasing the batch size, binding cores, and replacing operators with fused ones.

Figure 5 Scheduling issue

Communication Issues

When the data metrics indicate a communication issue, you need to access the communication page for further analysis. The communication page is used to display the network link performance across the cluster and the communication performance of all nodes. By analyzing the overlapped duration between cluster communication and computation, slow hosts or nodes in the cluster training can be identified. Typically, we analyze performance issues based on key metrics such as the communication matrix and communication duration.

Communication Matrix

Figure 6 Communication matrix

The above figure is the visualized communication matrix on MindStudio Insight, where you can obtain information such as bandwidth, transfer size, link type, and transfer duration between cards for each communicator.

  1. You can first check the transfer size to analyze whether there are any differences in the transfer volume of each card in this collective communication, and whether there is any uneven distribution.
  2. Next, check the transfer duration and bandwidth. If the transfer duration and bandwidth values between different cards are abnormal or have large discrepancies, it indicates the presence of abnormal links within the communication domain.

Communication Duration

Communication duration refers to the time spent on communication between compute cards. There are many factors that can lead to excessive communication time, such as incorrect communication protocol configurations, excessive data transfer volume, and so on. Only by identifying these links with excessive communication time and properly resolving the issues can data be transmitted more smoothly between compute cards, thereby improving the overall performance of the cluster.

After selecting a specific communicator, the user can view the summary of the time spent by each compute card in the communicator, along with the timing diagram and communication duration distribution diagram of each communication operator. This allows for a quick understanding of the relative position of the communication operators and detailed communication data.

Figure 7 Communication duration analysis

The above figure is an analysis of communication duration within a specific communicator from a profile data set. From the communication duration data, we found that card 4 has the longest synchronization time, card 0 has the shortest synchronization time, and there is a noticeable difference in synchronization time. A longer synchronization time generally means that this card is waiting for other cards, while a shorter synchronization time generally means that other cards are waiting for this card. Based on this, it can be preliminarily determined that card 4 is a fast card and card 0 is a slow card.