Cluster Performance Deterioration Analysis

For the overall approach to analyzing cluster performance problems, see Troubleshooting Principles.

Performance problems in long-term, stable training are usually performance fluctuations that occur during a normal training process. To solve these problems, you need to consider recent changes and preliminarily check for any hardware issue, then use the profiling tool to identify problem details.

To improve the locating efficiency, you are advised to take a two-step methodology of rough and fine locating, as shown in Figure 1.

Figure 1 Locating process

The two-step methodology of rough and fine locating is designed to streamline profiling data collection and analysis in large-scale cluster training scenarios. Rough locating focuses on recent cluster changes (such as configuration changes and component upgrades) and hardware metrics (such as CPU/memory usage and network throughput) to quickly determine suspicious time periods and modules. If no exceptions are detected, take the step of fine locating. Use the profiling tool for in-depth analysis on computing, communication, and delivery to determine root causes from the perspectives of thread stack, I/O latency, and lock contention. This methodology emphasizes the correlation analysis of change information and metrics, and uses visualization methods such as profiling, flame graphs, and traces to implement exception locating and root cause analysis, providing a precise, accurate, and layered diagnosis approach for cluster performance optimization.

In addition, to make verification easier, you are advised to use the test methods of N-partitioning and single-node service modeling test to reproduce the problem on a smaller scale.

Rough locating
Rough locating is usually performed in the scenario where no profiling is available. This approach is mainly based on the analysis and experience gained from past issues. Significant problems mainly fall into the following four categories:
- Resource accumulation: Allocated resources are not released in a timely manner or requested abnormally, causing resource accumulation and affecting performance.
- Resource preemption: Other processes run on the host, or other machines perform high-intensity read/write operations on shared storage without proper resource isolation.
- Communication retransmission: Packet loss and other issues trigger communication retransmissions, which degrade performance.
- Environment change: Recent changes include but are not limited to upgrade, new cluster usage, and startup or stop of some services.
You can use the following rough locating methods before performing profiling analysis.
- Version check: Collect the current versions including at least HDK, OS, CANN, and framework. Check the compatibility list to identify compatibility issues and verify whether known performance-related problems exist in these versions.
- Change check: Confirm whether any changes have been made recently, including but not limited to version upgrades and cluster re-partitioning.
- Environment check: Check for hardware alarms and verify that key KPIs such as storage (I/O) and network (packet loss) are normal.
- Configuration check: Check whether the typical check items specified in the training pre-check tool guide and inference pre-check tool guide are correctly configured.
The check items for rough locating are derived from historical problem locating experience and are applicable to most common performance fluctuation problems. For the problems that cannot be identified through rough locating, perform fine locating.

In addition, you can use the test methods of cluster N-partitioning and single-node service modeling to reproduce the problem at the smallest possible scale. If the problem is reproduced on a single node, check whether a process preemption occurs, CPU usage changes, or whether any alarm or error exists in the messages or dmesg log.
Fine locating
For the process of fine locating, see Detailed Troubleshooting and use model fine-tuning tools to further locate and analyze exceptions.

Compare and verify the conclusions of the rough and fine locating processes to obtain the root cause and solution.

Parent topic: Methodology for Locating Performance Fluctuation Problems in a Cluster