MindStudio Insight Analysis and Locating

MindStudio Insight loads all data for fault locating.

  1. The communication time of each card in this communicator takes relatively high proportion. The total computation time (pure computation time + overlapped communication time) only accounts for one-third of the total time, so it can be identified as a communication issue.
    Figure 1 Summary page
  2. Switch to the communication page: A large number of card asynchronization issues were found (highlighted in red in the box), which indicates that many operators are waiting for extended periods. The most obvious slow card (card 12) is selected for analyzing the detailed cause.
    Figure 2 Communication page
  3. Switch to the timeline page: It is clear that card 12 has a lot of free time. At the same time, there are many events on the AscendCL side that are occupying resources. It can be preliminarily determined that it is caused by excessive memory usage on this card. When new data is requested, memory reorganization is required, leading to extended idle periods. We can use export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" to resolve the memory fragmentation issue and improve memory utilization. Solve the performance problem after debugging.
    Figure 3 Timeline page
    Figure 4 Timeline page