Detailed Troubleshooting

The detailed troubleshooting focuses on three types of common issues: delivery, communication, and computing. For details about how to use the performance tools, see Model Tuning Tools.

Figure 1 Detailed troubleshooting flowchart

Delivery Issues

The delivery issues refer to the abnormal time consumption during operator delivery. Graph Engine (GE) sends the operator execution request to the Runtime, and Runtime identifies the operator's task type and forwards the execution request to the device for execution based on that task type. For details, see Workflow for Operator Building and Running in TBE & AI CPU Operator Development Guide.

Under normal conditions, the computing pipeline on the NPU runs continuously without waiting for the CPU. However, if the delivery is delayed, the pipeline is blocked, resulting in low computing power utilization of AI Cores. In such cases, a delivery issue is identified. Use the MindStudio Insight tool to identify the delivery issue by referring to Timeline. If any of the following symptoms occurs, analyze the issue by referring to Task Dispatch Anomaly Analysis.

  • Excessive Free time: The Free proportion in Overlap Analysis is far greater than that of Computing and Communication. The ideal Free Time proportion is less than 10%.
    Figure 2 Viewing free time
  • The HostToDevice connection lines in the red box are almost vertical, and the Device lines in the blue box are relatively idle.
    Figure 3 Viewing connections
  • Frequent HostToDevice copying interrupts the asynchronous pipeline, causing a delivery bottleneck.
    Figure 4 Locating the delivery bottleneck

Communication Issues

Communication issues generally refer to abnormal communication between NPUs. Typical symptoms are slow and fast cards or the communication bandwidth is far lower than expected. For details, see Cluster Performance Analysis of MindStudio Insight. For details about the solution, see Communication Tuning Solutions.

In large cluster scenarios, the data volume of full cluster profiling may be excessive, making analysis cumbersome. You are advised to use the cluster_analyze tool on full cluster profiling and import the cluster_analysis_output directory to the MindStudio Insight. For details about the tool, see Quick Analysis for Model Tuning (msprof-analyze CLI). Then examine the output to identify any cards that are noticeably faster or slower, as well as potential communication or transmission issues, and select the profiling of some cards for single-card analysis.

  • Figure 5 demonstrates the Communication Duration Analysis function on the Communication page of the MindStudio Insight. Each color block represents a collective communication operator, and its length represents the execution time of the communication operator. For a collective communication operator, when execution times between different cards vary significantly, the card with the shortest execution time functions as a slow card because other cards must wait for it to complete.
    Figure 5 Using the Communication Duration Analysis of MindStudio Insight to locate slow cards
  • In the timeline, you can see obvious inter-card wait. As highlighted in the red box in Figure 6, fast card rank 6 has completed its computation and is waiting for slow card rank 5 to finish.
    Figure 6 Timeline comparison of Profiling results between fast and slow cards (at Ascend hardware layer)
  • As shown in Figure 7, the duration of the hcom_allGather communication operator of the fast card rank 6 is longer than that of the slow card rank 5, and the duration is mainly caused by synchronization wait.
    Figure 7 Timeline comparison of Profiling results between fast and slow cards (communication unit)

Computing Issues

The computing issues refer to operator performance issues, a key challenge in the deep learning model. Specifically, the execution efficiency of some basic compute units is low, affecting the running speed of the entire model and causing resource waste. Such issues need to be solved by using dedicated analysis tools and code tuning technologies. For example, when evaluating the performance of a fused operator, you can compare metrics such as the computing time and memory usage under different configurations. For details about how to locate and rectify the fault, see Operator Performance Tuning Solutions.

Figure 8 Operator performance issue locating