GUI Description

Function

The Summary tab page provides the communication group identification, division, and time breakdown and analysis functions. Communication groups can be automatically identified or configured by users. Users can compare the duration of stages, computation, and communication in a communication group to analyze whether the division in the same communication group is even and whether there are slow cards and slow links, helping developers quickly identify problems.

GUI Display

The Summary tab page consists of Base Info (area 1), Parallel Strategy Analysis (area 2), and MoE Expert Load Balancing Analysis (area 3), as shown in Figure 1.

Figure 1 Summary page

Area 1: Cluster and Base Info. When cluster data is imported, you can select a cluster from the drop-down list. The basic information includes the device count, step count, report size, and profiling session duration.

Area 2: the Parallel Strategy Analysis area, including the parallel strategy overview, Computation/Communication Overview, Computing Detail (Rank ID), Communication Detail (Rank ID), and pipeline parallelism chart.

The parallel strategy overview includes parallel strategy settings, parallel strategy graphs, and expert suggestions, as shown in Figure 2. Table 1 describes the parameters for parallel strategies. After the parallel strategy is set, you can select DP + PP, DP + PP + CP, DP + PP + TP, or DP + PP + CP + TP to display the parallel strategy graph. You can select a target index in the graph to view its details. You can also right-click the target index and choose Copy Attribute from the shortcut menu to copy the details and paste it to the local PC for analysis.

If the communication time can be correctly split by communication group, the advice is displayed.

Figure 2 Parallel strategy overview

**Table 1** Parallel strategy parameters
Field	Description
Algorithm	Megatron-LM (tp-cp-ep-dp-pp): This arrangement is based on Megatron-Core. The priority order is TP > CP > DP > PP, with EP spanning above DP without participating in or affecting the priority (requiring that DP must be exactly divisible by EP). PP is located after DP and is most used for cross-node communication, which requires low bandwidth. Megatron-LM (tp-cp-pp-ep-dp): This arrangement is also based on Megatron-Core and is rarely used. The priority order is TP > CP > PP > DP. It is used in scenarios where PP requires relatively high bandwidth. MindSpeed (tp-cp-ep-dp-pp): This arrangement is based on MindSpeed-Core. The priority order is TP > CP > DP > PP. The EP spans across CP and DP and does not participate in or affect the priority (requiring that CP × DP must be exactly divisible by EP). MindIE-LLM (tp-dp-ep-pp-moetp): This arrangement is based on MindIE-LLM (DeepSeek V3 uses a similar arrangement). For non-MoE layers, the priority order is TP > DP > PP. For MoE layers, the priority order is MOE_TP > EP > PP. When MOE_TP=1, an MoE EP solution that spans the same stage of PP is formed. vLLM (tp-pp-dp-ep): This arrangement is based on vLLM. The priority order is TP > PP > DP. The EP spans TP and DP and does not participate in or affect the priority (requiring that TP × DP must be exactly divided by EP). An MoE EP solution is formed (that is, EP must be exactly divisible by TP). When the vLLM (tp-pp-dp-ep) algorithm is selected, PP size × TP size × DP size ≥ Number of imported cards.
PP Size	Pipeline parallel size. You can set it to a value ranging from 1 to 10000. Pipeline Parallel distributes different layers of a model to different cards for execution. When one card executes the current batch of data, another card can process the next batch of data.
TP Size	Tensor parallel size. You can set it to a value ranging from 1 to 10000. Tensor Parallel is a technique that divides model parameters into multiple parts and distributes them to different cards for computation.
CP Size	Context parallel size. You can set it to a value ranging from 1 to 10000. Context Parallel divides training samples into different batches based on the sequence length and dimension and allocates the batches to different cards for computation. Context Parallel (CP) splits the network input and all activations, which is an improved version of Sequence Parallelism (SP).
DP Size	Data parallel size. You can set it to a value ranging from 1 to 10000. Data Parallel divides the training dataset into multiple batches and allocates the batches to different cards for calculation.
EP Size	Expert parallel size. You can set it to a value ranging from 1 to 10000. Expert Parallel is a parallel method designed for Mixture of Experts (MoE) models. Experts are allocated to different computing cards. Each card processes some training samples, and each card can contain one or more experts. When the Megatron-LM algorithm is selected, the EP size must be less than or equal to the DP size, and the DP size must be exactly divisible by the EP size. When the MindSpeed algorithm is selected, DP x CP must be exactly divisible by EP.
MoE-TP Size	This parameter is available only when the MindIE-LLM (tp-dp-ep-pp-moetp) algorithm is used. TP size of the MoE layer in the inference parallel strategy, which is different from the TP of the non-MoE layer. You can set it to a value ranging from 1 to 10000. The values must meet the following requirements: PP Size × TP Size × DP Size = Number of imported cards TP Size × DP Size = MoE-TP Size × EP Size
Performance Metric	You can display the parallel policy graph by performance metrics. The available performance metric parameters vary according to the selected domain dimension. If None is selected, no performance metric is displayed. That is, the card information on the parallel strategy graph is in the default state. If you select other parameters, Visible Range (μs) and color bar of the parameter are displayed next to the check box. The filter range is the minimum and maximum values of the parameter. The card on the parallel policy graph is rendered and filled with colors according to the corresponding values. You can view the performance of each card.
Target Index	Enter the target index in the selected dimension to accurately locate the required number.
Advice	The built-in expert analysis function of MindStudio Insight analyzes data, provides suggestions, and lists the top 3 groups and slow cards, making it easier for developers to spot performance issues.

To ensure that the communication time can be correctly split by communication group (TP-communication time, PP-communication time, MP-communication time, DP-communication time, and CP-communication time in the performance metrics), ensure that the parallel strategy parameter value is the same as the parallel parameter configuration during actual model training or inference. You can confirm the parallel parameters with the model developers.

Computation/Communication Overview: displays the step duration data of the computing or communication operator in a bar chart, and the duration percentage data of the computing or communication operator in a curve. Advice provides suggestions after the computation time, communication time (not overlapped), and free time analysis of each card in the computing and communication group, allowing developers to quickly analyze the computing and communication time, as shown in Figure 3. Table 2 describes the parameters.

The curve and advice are displayed in the Computation/Communication Overview area when DP + PP + CP + TP or DP + PP + TP is selected.

Figure 3 Computation/Communication Overview

**Table 2** Computation/Communication Overview parameters
Field	Description
Step	Step ID. You can select a specific step or all steps from the drop-down list.
Rank Group	Node ID. You can select one, multiple, or all nodes from the drop-down list.
Order By	Set Order By to different dimensions. DP + PP + CP + TP and DP + PP + TP Rank ID: Rank ID. Computing(Not Overlapped): Computing(Not Overlapped) = Computing – Computing_Communication Overlapped. Computing_Communication Overlapped: communication duration overlapped by computing. Communication(Not Overlapped): Communication(Not Overlapped) = Communication – Computing_Communication Overlapped. Free: time when the device is not in communication or calculation. The free time here is not included in the preparing time. Preparing: time from the start of a step to the running of the first computing or communication operator. During this time, operations such as data loading and copying are performed. In the Overlap Analysis unit of Timeline, the time is considered as Free. Computing Ratio: ratio of the computing time to the total time. Total time = Computing(Not Overlapped) + Computing_Communication Overlapped + Communication(Not Overlapped) + Free + Preparing. Communication Ratio: ratio of the communication time to the total time. DP + PP + CP Dimension: Order By can be set to Rank ID, Max Computing, Max Communication, Max Free, and Max Total Time (Computing + Communication (Not Overlapped) + Free). The maximum value of each parameter is the maximum value in each TP communication group. DP + PP Dimension: Order By can be set to Rank ID, Max Computing, Max Communication, Max Free, Max Total Time (Computing + Communication (Not Overlapped) + Free), and Max Communication(Not Overlapped). The maximum value of each parameter is the maximum value in the communication group of each DP + PP + CP Dimension.
Top	You can set Top to display the top N records of Order By.
Time(μs)	The vertical coordinate on the left indicates the duration, in microseconds. The calculation formula is as follows: Total = Computing(Not Overlapped) + Computing_Communication Overlapped + Communication(Not Overlapped) + Free + Preparing. Preparing indicates the data preprocessing time.
Ratio	The vertical coordinate on the right indicates the duration percentage, including the following information: Computing Ratio: Computing Ratio = Total Computing/Time. Communication Ratio: Communication Ratio = Communication(Not Overlapped)/Time.
Advice	The analysis and suggestions of slow cards based on the communication time under each parallel dimension help you to locate slow cards.

Computing Detail (Rank ID): This area is displayed when DP + PP + CP + TP or DP + PP + TP is selected. After you click the bar chart of a node in the Computation/Communication Overview area, the total duration and usage of the accelerator core of the node are displayed. After you click Details, the details about the computing operator are displayed, as shown in Figure 4. Table 3 describes the fields. You can click

in the upper right corner of the table to copy the content displayed in the table and paste the content to an Excel file for analysis.

Figure 4 Computing operator details

**Table 3** Computing Detail fields
Field	Description
Accelerator Core	AI accelerator core type, including AI Core and AI CPU.
Accelerator Core Durations(μs)	Total duration of the accelerator core.
Name	Operator name.
Type	Operator type.
Start Time(ms)	Operator execution start time.
Duration(μs)	Execution duration of the current operator.
Wait Time(μs)	Waiting time for executing the operator.
Block Dim	Number of running splits, which corresponds to the number of cores during task running.
Input Shapes	Operator input shape.
Input Data Types	Input data type of the operator.
Input Formats	Input format of the operator.
Output Shapes	Operator output shape.
Output Data Types	Output data type of the operator.
Output Formats	Output format of the operator.

Communication Detail (Rank ID). This area is displayed when DP + PP + CP + TP or DP + PP + TP is selected. After you click the bar chart of a node in the Computation/Communication Overview area, the total duration of the communication operator of the node (including the communication (not overlapped) duration and the communication (overlapped) duration) is displayed. After you click Details, the details about the communication operator are displayed, as shown in Figure 5. Table 4 describes the fields. You can click

in the upper right corner of the table to copy the content displayed in the table and paste the content to an Excel file for analysis.

If a DB scenario file is imported, the Communication Detail (Rank ID) area is not displayed.

Figure 5 Communication operator details

**Table 4** Communication Detail fields
Field	Description
Accelerator Core	AI accelerator core type, including AI Core and AI CPU.
Communication(Not Overlapped) Durations(μs)	Non-overlapped communication duration, which refers to the pure communication duration.
Communication(Overlapped) Durations(μs)	Overlapped communication duration.
Name	Communication operator name.
Type	Communication operator type.
Start Time(ms)	Start time of communication operator execution.
Duration(μs)	Execution duration of the current communication operator.
Wait Time(μs)	Waiting time for executing the communication operator.

When DP + PP + CP + TP or DP + PP + TP is selected, click a single card icon in the parallel strategy graph, and a flow is displayed. Click the pipeline parallel flow, and the pipeline parallelism chart is displayed, as shown in Figure 6.
On the pipeline parallelism chart, you can drag either side of the slider below the graph to zoom in or zoom out. You can also move the slider leftward or rightward using the mouse, or press Shift + left or right arrow key to move the parallel graph leftward or rightward.
Figure 6 Pipeline Parallelism Chart

Area 3: MoE expert load balancing analysis, which displays the expert activation heatmap and expert load balancing heatmap.

You can select Profiling or Dump for Data Version in the parameter configuration area. The two data types are statistical information of MoE models in different dimensions.

If Profiling is selected, the expert distribution heatmap is displayed. The heatmap is based on Profiling and collects statistics on the time consumed by the GroupedMatmul operator in each MoE layer. Since the GroupedMatmul operator is the core of MoE model computation, its performance directly affects how quickly experts respond.

If Dump is selected, the MoE model expert load balancing heatmap is displayed. The heatmap is based on Dump and collects statistics on the number of tokens processed by each expert in each MoE layer. You can select Dump unbalanced or Dump balanced and click

to import the corresponding file. The MoE model expert load balancing heatmap is displayed, as shown in Figure 7. Table 5 describes the parameters. For details about how to collect data files of Dump unbalanced and Dump balanced, see "Features" > "Load Balancing" in MindIE LLM Development Guide.

Figure 7 MoE expert load balancing analysis

**Table 5** Parameters for MoE expert load balancing analysis
Field	Description
Model Layer Num	You can set it to a value ranging from 1 to 500.
Dense Layer List	Select one or more layers.
Expert Num	You can set it to a value ranging from 1 to 500.
Model Stage	Two phases of model inference: Prefill and Decode
Data Version	The value can be Profiling, Dump unbalanced, or Dump balanced.

Parent topic: Summary