Hands-on Skills for Problem Locating

In large-scale scenarios, if the training task needs to be completed within a specified period, you can first specify an optimization objective and calculate the benefits of problem locating (compare the time loss due to experiments and collection in the production cluster for problem locating with the benefits brought by optimization).
To locate problems in a large-scale cluster, you can reproduce the problems in a smaller-scale cluster or even a single-node system to facilitate experiments and reduce the impact on production tasks. The methods include but are not limited to N-partitioning, single-node test, and pre-check.
In practice, L1 without stack is used for the initial collection. In large-scale cluster scenarios, if data is directly written to the shared storage, the total size of collected data may be too large. In addition, if resources are not properly isolated, other jobs in the cluster may be affected. Therefore, you are advised to write the profiling data to the local system, collect the data using scripts, and transfer the collected data to the shared storage in batches.
If possible, you can enable dynamic profile data collection during model training.

Parent topic: Methodology for Locating Performance Fluctuation Problems in a Cluster