Performance Analysis in Cluster Training Scenarios

Scenario

A cluster consists of multiple nodes, which are management in a unified manner on the management page. Each node has an independent system. In cluster scenarios, the tool collects profile data of each node, generates a PROF_XXX directory on each node, and pre-parses and summarizes all PROF_XXX directories to OBS. You need to manually copy all PORF_XXX directories summarized by OBS to an environment where cluster data can be displayed and analyzed.

Currently, the following tool supports cluster data display and analysis: MindStudio Insight.

Profile Data Collection Process

The following figure shows the overall process of profile data collection.

Figure 1 Profile data collection process

Environment Setup

In cluster scenarios, you need to set up the environment by yourself.
Install a proper CANN software package on the corresponding node as required. For details, see the CANN Software Installation Guide.
Install MindStudio Insight. For details, see MindStudio Insight User Guide.

Restrictions

In cluster scenarios, profile data of a maximum of 128 nodes can be collected. If eight devices are configured for each node, profile data of a maximum of 1024 devices can be collected.

Profile Data Collection

After the environment is set up, you can collect profile data in the cluster scenario as follows:

Use Profile Data Collection with MindSpore Framework APIs to collect profile data.
Use the Ascend PyTorch Profiler API to collect PyTorch profile data.
1. Set up a distributed training environment and prepare the distributed training script used after migration. For details, see "Porting Adaptation " in the PyTorch Training Model Porting and Tuning Guide .
2. Refer to Profiling Quick Start (PyTorch Training/Online Inference) to modify the training script and start distributed training for data collection.

Data Display

The profile data in cluster scenarios needs to be displayed on the GUI of the MindStudio Insight. For details, see MindStudio Insight User Guide.

Parent topic: Appendixes