Overview

This section focuses on methods for accurately identifying the root causes of performance issues and providing solutions, covering a wide range of problems from general to specific. In high-performance computing and large-scale distributed training, single-dimension analysis is insufficient for quickly locating issues, due to complex and dynamic performance bottlenecks. This section presents technical approaches for seven common types of performance issues, covering macro-level trend identification and micro-level comprehensive troubleshooting.

Table 1 Performance issues

Issue Type

Description

Communication issues

During distributed training, the overall cluster performance deteriorates due to inefficient inter-node data transmission or slow and fast cards. In this case, use the function described in Cluster Performance Analysis of MindStudio Insight to preliminarily analyze the profiling results and determine whether the issue is caused by slow and fast cards or by inefficient communication transmission. For slow and fast cards, compare the timelines of the two cards to determine the root cause. For inefficient communication transmission, perform tuning based on different causes (such as small communication packets, communication retransmission, and unaligned bytes).

Operator performance issues

During training, some operators have low execution efficiency and occupy a large number of resources, which becomes the overall performance bottleneck. In this case, use Advisor and MindStudio Insight to automatically identify time-consuming or inefficient operators. Based on the call stack and code logic analysis, the tools propose an improvement path for operator replacement, convergence, or tuning.

Delivery exceptions

During model training, the overall training efficiency decreases due to operator delivery delay or unbalanced task allocation of some cards. In this case, check the system environment, task allocation, and core binding policy, and locate the performance deterioration caused by delivery delay based on the CPU running status and stack information.

Cluster performance issues

In a large-scale cluster environment, the overall training performance often deteriorates due to a large number of nodes and complex communication. Compared with a small cluster, the performance file generated in this case is large, making it difficult to locate the root cause of an issue. You can use the cluster analyse tool of msprof-analyze to efficiently identify abnormal nodes. This tool helps simplify a large-scale cluster into multiple small-scale clusters or multi-card systems for further analysis and resolution.

Atlas 200I/500 A2 inference products scenarios

The Atlas 200I/500 A2 inference products inference performance is limited by the data transfer bottleneck. The MTE2/MTE3 instructions consume a large portion of execution time, resulting in high model inference latency and low throughput. In this case, multiple methods can be applied, including model compression and quantization, ONNX simplification, AOE tuning, msit debug surgeon automatic tuning, and CANN version upgrading. These approaches reduce redundant computation, optimize memory layout, and improve operator execution efficiency, easing data transfer pressure and boosting overall inference performance.

MindIE inference scenarios

The serving inference performance is determined by the scheduling mechanism and model inference capability. Serving parameters must be adjusted flexibly to accommodate different test scenarios (such as concurrency, sequence length, latency, and throughput), while monitoring multiple performance metrics. This section explains how to optimize performance and identify issues. First, evaluate the inference upper limit through a pure model test to locate bottlenecks. Next, adjust the serving scheduling configuration based on the test results and tuning objectives. Example cases demonstrate using tools such as pre-check and msServiceProfiler to identify performance issues. In addition, tuning suggestions for DeepSeek models are provided.

Version upgrade

This method is used to quickly check for recent changes, such as cluster replanning or version updates.