Detailed Troubleshooting
The detailed troubleshooting focuses on three types of common issues: delivery, communication, and computing. For details about how to use the performance tools, see Model Tuning Tools.
- The msprof-analyze tool performs preliminary analysis, identifies performance issues at a fine-grained level, and provides clear direction for further in-depth analysis. For details, see Quick Analysis for Model Tuning (msprof-analyze CLI).
- The MindStudio Insight tool identifies bottlenecks and analyzes root causes. For details, see In-depth Analysis for Model Tuning (MindStudio Insight).
Delivery Issues
The delivery issues refer to the abnormal time consumption during operator delivery. Graph Engine (GE) sends the operator execution request to the Runtime, and Runtime identifies the operator's task type and forwards the execution request to the device for execution based on that task type. For details, see Workflow for Operator Building and Running in TBE & AI CPU Operator Development Guide.
Under normal conditions, the computing pipeline on the NPU runs continuously without waiting for the CPU. However, if the delivery is delayed, the pipeline is blocked, resulting in low computing power utilization of AI Cores. In such cases, a delivery issue is identified. Use the MindStudio Insight tool to identify the delivery issue by referring to Timeline. If any of the following symptoms occurs, analyze the issue by referring to Task Dispatch Anomaly Analysis.
- Excessive Free time: The Free proportion in Overlap Analysis is far greater than that of Computing and Communication. The ideal Free Time proportion is less than 10%.
Figure 2 Viewing free time
- The HostToDevice connection lines in the red box are almost vertical, and the Device lines in the blue box are relatively idle.
Figure 3 Viewing connections
- Frequent HostToDevice copying interrupts the asynchronous pipeline, causing a delivery bottleneck.
Figure 4 Locating the delivery bottleneck
Communication Issues
In large cluster scenarios, the data volume of full cluster profiling may be excessive, making analysis cumbersome. You are advised to use the cluster_analyze tool on full cluster profiling and import the cluster_analysis_output directory to the MindStudio Insight. For details about the tool, see Quick Analysis for Model Tuning (msprof-analyze CLI). Then examine the output to identify any cards that are noticeably faster or slower, as well as potential communication or transmission issues, and select the profiling of some cards for single-card analysis.
- Figure 5 demonstrates the Communication Duration Analysis function on the Communication page of the MindStudio Insight. Each color block represents a collective communication operator, and its length represents the execution time of the communication operator. For a collective communication operator, when execution times between different cards vary significantly, the card with the shortest execution time functions as a slow card because other cards must wait for it to complete.
- In the timeline, you can see obvious inter-card wait. As highlighted in the red box in Figure 6, fast card rank 6 has completed its computation and is waiting for slow card rank 5 to finish.
- As shown in Figure 7, the duration of the hcom_allGather communication operator of the fast card rank 6 is longer than that of the slow card rank 5, and the duration is mainly caused by synchronization wait.
Computing Issues
The computing issues refer to operator performance issues, a key challenge in the deep learning model. Specifically, the execution efficiency of some basic compute units is low, affecting the running speed of the entire model and causing resource waste. Such issues need to be solved by using dedicated analysis tools and code tuning technologies. For example, when evaluating the performance of a fused operator, you can compare metrics such as the computing time and memory usage under different configurations. For details about how to locate and rectify the fault, see Operator Performance Tuning Solutions.


