Quick Troubleshooting
Quick checks can be performed in two scenarios: unpacking and long-term stability scenarios. The details are as follows:
- Unpacking scenario: The unpacking performance issues typically occur during the first model loading. In this case, the tuning objective should be defined first. If performance benchmarks from competing products are available, tools such as Profiling can be used to analyze the differences in detail. You are advised to optimize the parallel strategy and determine the optimal configuration before starting a task. If the issue persists, refer to the long-term stability scenario for further troubleshooting.
- Long-term stability scenario: Performance issues related to long-term stability usually arise when the system, after running without issues for a period, suddenly experiences performance degradation or problems.
- Change check: Check whether changes have been made recently, including but not limited to cluster replanning and version changes. If the performance issue arises after these changes, try rolling back the changes, if possible. If the issue is confirmed to be caused by the change, focus on the impact of the version update or operations (such as restart) on the cluster. For more details, see Methodology for Locating Performance Deterioration During Version Upgrade.
- Hardware check: When the performance fluctuates, check whether hardware faults occur at the corresponding time point, for example, hardware alarms such as NPU frequency reduction and network packet loss. Note that this hardware check is the initial step and primarily focuses on hardware alarms. If no hardware alarms or key events such as packet loss are identified, refer to Detailed Troubleshooting.
Parent topic: Troubleshooting Principles