Overview
This section focuses on methods for accurately identifying the root causes of performance issues and providing solutions, covering a wide range of problems from general to specific. In high-performance computing and large-scale distributed training, single-dimension analysis is insufficient for quickly locating issues, due to complex and dynamic performance bottlenecks. This section presents technical approaches for seven common types of performance issues, covering macro-level trend identification and micro-level comprehensive troubleshooting.
Issue Type |
Description |
|---|---|
Communication issues |
During distributed training, the overall cluster performance deteriorates due to inefficient inter-node data transmission or slow and fast cards. In this case, use the function described in Cluster Performance Analysis of MindStudio Insight to preliminarily analyze the profiling results and determine whether the issue is caused by slow and fast cards or by inefficient communication transmission. For slow and fast cards, compare the timelines of the two cards to determine the root cause. For inefficient communication transmission, perform tuning based on different causes (such as small communication packets, communication retransmission, and unaligned bytes). |
Operator performance issues |
During training, some operators have low execution efficiency and occupy a large number of resources, which becomes the overall performance bottleneck. In this case, use Advisor and MindStudio Insight to automatically identify time-consuming or inefficient operators. Based on the call stack and code logic analysis, the tools propose an improvement path for operator replacement, convergence, or tuning. |
Delivery exceptions |
During model training, the overall training efficiency decreases due to operator delivery delay or unbalanced task allocation of some cards. In this case, check the system environment, task allocation, and core binding policy, and locate the performance deterioration caused by delivery delay based on the CPU running status and stack information. |
Cluster performance issues |
In a large-scale cluster environment, the overall training performance often deteriorates due to a large number of nodes and complex communication. Compared with a small cluster, the performance file generated in this case is large, making it difficult to locate the root cause of an issue. You can use the cluster analyse tool of msprof-analyze to efficiently identify abnormal nodes. This tool helps simplify a large-scale cluster into multiple small-scale clusters or multi-card systems for further analysis and resolution. |
The |
|
MindIE inference scenarios |
The serving inference performance is determined by the scheduling mechanism and model inference capability. Serving parameters must be adjusted flexibly to accommodate different test scenarios (such as concurrency, sequence length, latency, and throughput), while monitoring multiple performance metrics. This section explains how to optimize performance and identify issues. First, evaluate the inference upper limit through a pure model test to locate bottlenecks. Next, adjust the serving scheduling configuration based on the test results and tuning objectives. Example cases demonstrate using tools such as pre-check and msServiceProfiler to identify performance issues. In addition, tuning suggestions for DeepSeek models are provided. |
Version upgrade |
This method is used to quickly check for recent changes, such as cluster replanning or version updates. |