Cluster Iteration Analysis

Cluster Iteration Analysis summarizes the iteration performance analysis data in the training cluster scenario, including the information on the summary page and detailed data of each iteration.

MindStudio does not support data collection in the cluster scenario. You can use Merge Reports to import the parent directory of PROF_XXX to display the collected profile data.

Summary Page

When you access Cluster Analysis for the first time, the summary page is displayed, on which a maximum of 10 groups of data can be included in a bar chart.

The summary page is divided into areas 1 to 4. For details about the fields, see Table 1, Table 2, Table 3, and Table 4.

Figure 1 Iteration ID
Figure 2 Rank ID
  • When Type is set to Iteration ID, Step Trace (iteration trace data) and Collective Communication (collective communication data) are displayed. When Type is set to Rank ID, only Step Trace is displayed.
  • If you click a bar chart in the Step Trace area on the summary page, the detailed iteration data page corresponding to the iteration ID or rank ID will be prompted.
  • The horizontal and vertical coordinates of the bar chart are described as follows:
    • When Type is set to Iteration ID, the horizontal coordinates sort the iteration traces of all cluster nodes by total duration in descending order from left to right by default (collective communication data is sorted by communication time in descending order). If you click a column name in a table in area 2 or 4, the bar chart sorts the column values. The vertical coordinates are durations.
    • When Type is set to Rank ID, the horizontal coordinates sort all iteration traces of the current cluster node by total duration in descending order from left to right by default. If you click a column name in a table in area 2, the bar chart sorts the column values. The vertical coordinates are iteration durations.
Table 1 Fields in area 1

Field

Description

Type

Data display mode:

  • Iteration ID: When you set Type to Iteration ID and click Apply, the bar chart in the lower part displays the iteration data of all cluster nodes in the current iteration. See Figure 1.
  • Rank ID: When you set Type to Rank ID and click Apply, the bar chart in the lower part displays all iteration data of the current node. See Figure 2.

Iteration ID

Iteration ID, used for querying the iteration data of all devices in a specified iteration.

Rank ID

Rank ID, used for querying all iteration data of a specified node.

Model ID

Model ID, used for querying the iteration data of a specified model in a specified iteration or on a specified node.

Apply

Data export button. After you select an iteration ID/rank ID and a model ID, and click this button, the Cluster Iteration Analysis report of the corresponding node is exported.

Step Trace

Iteration trace data.

Bar Chart

Use the bar chart to display the iteration duration data. If this parameter is selected, FP to BP time, Iteration Refresh, and Iteration Interval are displayed in the bar chart in parallel.

Stack Chart

Use the stack chart to display the iteration duration data. If this parameter is selected, FP to BP time, Iteration Refresh, and Iteration Interval are displayed in a bar chart in stack mode.

Top

You can set the Top N value to display top N data records with the longest iteration durations. The value ranges from 1 to 200. The default value is 10.

Table 2 Fields in area 2

Field

Description

Iteration ID

Iteration ID.

Rank ID

Rank ID.

FP to BP time(us)

FP/BP elapsed time (= BP EndFP Start). The unit is μs.

Iteration Refresh(us)

Iteration refresh hangover time (= Iteration EndBP End). The unit is μs.

Iteration Interval(us)

Iteration interval. The unit is μs.

Total Time(us)

Total iteration duration.

Table 3 Fields in area 3

Field

Description

Collective Communication

Collective communication data.

Top

You can set the Top N value to display top N data records with the longest collective communication durations. The value ranges from 1 to 200. The default value is 10.

Table 4 Fields in area 4

Field

Description

Rank ID

Rank ID.

Stage Time(us)

Stage time. The unit is μs.

Communication Time(us)

Pure communication time. The unit is μs.

Computation Time(us)

Computation time. The unit is μs.

Detailed Iteration Data Page

When you click a bar chart in the Step Trace area on the summary page, the detailed profile data about the specified iteration ID/rank ID is displayed, including area 1 (Timeline), area 2 (Operator Statistics), and area 3 (Computing Workload). See Figure 3.
Figure 3 Page for detailed iteration data

Area 1:

For details about timeline data, see Timeline View.

Area 2:

Operator Statistics: operator statistics.

The pie chart on the left is associated with the data in the table on the right. When you click a column header, the pie chart displays the proportion of each data item based on the actual data in the column. For details about the fields, see Table 5.
Table 5 Fields in Operator Statistics

Field

Description

Model Name

Model name. It may be left empty if no related data is collected.

OP Type

Operator type.

Core Type

Core type.

Count

Number of calls to an operator.

Total Time(us)

Time taken by the calls to an operator (μs).

Min Time(us)

Minimum time required for calling an operator (μs)

Avg Time(us)

Average time required for calling an operator (μs)

Max Time(us)

Maximum time required for calling an operator (μs)

Total Time Ratio(%)

Percentage of duration of the operator calls in the model.

Area 3:

Computing Workload: operator computing workload.

The pie chart is not associated with the table on the right. It is drawn based on the proportion of each operator type in the OP Type column in the table. This pie chart is displayed only when Profiling collects data in task-based mode. The fields displayed are related to the AI Core collection type. For details about the fields, see AI Core Metrics View.