GUI Description

Function

The Summary tab page provides the communication group identification, division, and time breakdown and analysis functions. Communication groups can be automatically identified or configured by users. Users can compare the duration of stages, computation, and communication in a communication group to analyze whether the division in the same communication group is even and whether there are slow cards and slow links, helping developers quickly identify problems.

GUI Display

The Summary tab page consists of Base Info (area 1), Parallel Strategy Analysis (area 2), and MoE Expert Load Balancing Analysis (area 3), as shown in Figure 1.

Figure 1 Summary page
  • Area 1: Cluster and Base Info. When cluster data is imported, you can select a cluster from the drop-down list. The basic information includes the device count, step count, report size, and profiling session duration.
  • Area 2: the Parallel Strategy Analysis area, including the parallel strategy overview, Computation/Communication Overview, Computing Detail (Rank ID), Communication Detail (Rank ID), and pipeline parallelism chart.
    • The parallel strategy overview includes parallel strategy settings, parallel strategy graphs, and expert suggestions, as shown in Figure 2. Table 1 describes the parameters for parallel strategies. After the parallel strategy is set, you can select DP + PP, DP + PP + CP, DP + PP + TP, or DP + PP + CP + TP to display the parallel strategy graph. You can select a target index in the graph to view its details. You can also right-click the target index and choose Copy Attribute from the shortcut menu to copy the details and paste it to the local PC for analysis.
      If the communication time can be correctly split by communication group, the advice is displayed.
      Figure 2 Parallel strategy overview
      Table 1 Parallel strategy parameters

      Field

      Description

      Algorithm

      • Megatron-LM (tp-cp-ep-dp-pp): This arrangement is based on Megatron-Core. The priority order is TP > CP > DP > PP, with EP spanning above DP without participating in or affecting the priority (requiring that DP must be exactly divisible by EP). PP is located after DP and is most used for cross-node communication, which requires low bandwidth.
      • Megatron-LM (tp-cp-pp-ep-dp): This arrangement is also based on Megatron-Core and is rarely used. The priority order is TP > CP > PP > DP. It is used in scenarios where PP requires relatively high bandwidth.
      • MindSpeed (tp-cp-ep-dp-pp): This arrangement is based on MindSpeed-Core. The priority order is TP > CP > DP > PP. The EP spans across CP and DP and does not participate in or affect the priority (requiring that CP × DP must be exactly divisible by EP).
      • MindIE-LLM (tp-dp-ep-pp-moetp): This arrangement is based on MindIE-LLM (DeepSeek V3 uses a similar arrangement). For non-MoE layers, the priority order is TP > DP > PP. For MoE layers, the priority order is MOE_TP > EP > PP. When MOE_TP=1, an MoE EP solution that spans the same stage of PP is formed.
      • vLLM (tp-pp-dp-ep): This arrangement is based on vLLM. The priority order is TP > PP > DP. The EP spans TP and DP and does not participate in or affect the priority (requiring that TP × DP must be exactly divided by EP). An MoE EP solution is formed (that is, EP must be exactly divisible by TP). When the vLLM (tp-pp-dp-ep) algorithm is selected, PP size × TP size × DP size ≥ Number of imported cards.

      PP Size

      Pipeline parallel size. You can set it to a value ranging from 1 to 10000.

      Pipeline Parallel distributes different layers of a model to different cards for execution. When one card executes the current batch of data, another card can process the next batch of data.

      TP Size

      Tensor parallel size. You can set it to a value ranging from 1 to 10000.

      Tensor Parallel is a technique that divides model parameters into multiple parts and distributes them to different cards for computation.

      CP Size

      Context parallel size. You can set it to a value ranging from 1 to 10000.

      Context Parallel divides training samples into different batches based on the sequence length and dimension and allocates the batches to different cards for computation. Context Parallel (CP) splits the network input and all activations, which is an improved version of Sequence Parallelism (SP).

      DP Size

      Data parallel size. You can set it to a value ranging from 1 to 10000.

      Data Parallel divides the training dataset into multiple batches and allocates the batches to different cards for calculation.

      EP Size

      Expert parallel size. You can set it to a value ranging from 1 to 10000.

      Expert Parallel is a parallel method designed for Mixture of Experts (MoE) models. Experts are allocated to different computing cards. Each card processes some training samples, and each card can contain one or more experts.

      • When the Megatron-LM algorithm is selected, the EP size must be less than or equal to the DP size, and the DP size must be exactly divisible by the EP size.
      • When the MindSpeed algorithm is selected, DP x CP must be exactly divisible by EP.

      MoE-TP Size

      This parameter is available only when the MindIE-LLM (tp-dp-ep-pp-moetp) algorithm is used.

      TP size of the MoE layer in the inference parallel strategy, which is different from the TP of the non-MoE layer. You can set it to a value ranging from 1 to 10000. The values must meet the following requirements:

      • PP Size × TP Size × DP Size = Number of imported cards
      • TP Size × DP Size = MoE-TP Size × EP Size

      Performance Metric

      You can display the parallel policy graph by performance metrics. The available performance metric parameters vary according to the selected domain dimension.

      • If None is selected, no performance metric is displayed. That is, the card information on the parallel strategy graph is in the default state.
      • If you select other parameters, Visible Range (μs) and color bar of the parameter are displayed next to the check box. The filter range is the minimum and maximum values of the parameter. The card on the parallel policy graph is rendered and filled with colors according to the corresponding values. You can view the performance of each card.

      Target Index

      Enter the target index in the selected dimension to accurately locate the required number.

      Advice

      The built-in expert analysis function of MindStudio Insight analyzes data, provides suggestions, and lists the top 3 groups and slow cards, making it easier for developers to spot performance issues.

      To ensure that the communication time can be correctly split by communication group (TP-communication time, PP-communication time, MP-communication time, DP-communication time, and CP-communication time in the performance metrics), ensure that the parallel strategy parameter value is the same as the parallel parameter configuration during actual model training or inference. You can confirm the parallel parameters with the model developers.

    • Computation/Communication Overview: displays the step duration data of the computing or communication operator in a bar chart, and the duration percentage data of the computing or communication operator in a curve. Advice provides suggestions after the computation time, communication time (not overlapped), and free time analysis of each card in the computing and communication group, allowing developers to quickly analyze the computing and communication time, as shown in Figure 3. Table 2 describes the parameters.
      The curve and advice are displayed in the Computation/Communication Overview area when DP + PP + CP + TP or DP + PP + TP is selected.
      Figure 3 Computation/Communication Overview
      Table 2 Computation/Communication Overview parameters

      Field

      Description

      Step

      Step ID. You can select a specific step or all steps from the drop-down list.

      Rank Group

      Node ID. You can select one, multiple, or all nodes from the drop-down list.

      Order By

      Set Order By to different dimensions.

      • DP + PP + CP + TP and DP + PP + TP
        • Rank ID: Rank ID.
        • Computing(Not Overlapped): Computing(Not Overlapped) = ComputingComputing_Communication Overlapped.
        • Computing_Communication Overlapped: communication duration overlapped by computing.
        • Communication(Not Overlapped): Communication(Not Overlapped) = CommunicationComputing_Communication Overlapped.
        • Free: time when the device is not in communication or calculation. The free time here is not included in the preparing time.
        • Preparing: time from the start of a step to the running of the first computing or communication operator. During this time, operations such as data loading and copying are performed. In the Overlap Analysis unit of Timeline, the time is considered as Free.
        • Computing Ratio: ratio of the computing time to the total time. Total time = Computing(Not Overlapped) + Computing_Communication Overlapped + Communication(Not Overlapped) + Free + Preparing.
        • Communication Ratio: ratio of the communication time to the total time.
      • DP + PP + CP Dimension: Order By can be set to Rank ID, Max Computing, Max Communication, Max Free, and Max Total Time (Computing + Communication (Not Overlapped) + Free). The maximum value of each parameter is the maximum value in each TP communication group.
      • DP + PP Dimension: Order By can be set to Rank ID, Max Computing, Max Communication, Max Free, Max Total Time (Computing + Communication (Not Overlapped) + Free), and Max Communication(Not Overlapped). The maximum value of each parameter is the maximum value in the communication group of each DP + PP + CP Dimension.

      Top

      You can set Top to display the top N records of Order By.

      Time(μs)

      The vertical coordinate on the left indicates the duration, in microseconds. The calculation formula is as follows:

      Total = Computing(Not Overlapped) + Computing_Communication Overlapped + Communication(Not Overlapped) + Free + Preparing. Preparing indicates the data preprocessing time.

      Ratio

      The vertical coordinate on the right indicates the duration percentage, including the following information:

      • Computing Ratio: Computing Ratio = Total Computing/Time.
      • Communication Ratio: Communication Ratio = Communication(Not Overlapped)/Time.

      Advice

      The analysis and suggestions of slow cards based on the communication time under each parallel dimension help you to locate slow cards.

    • Computing Detail (Rank ID): This area is displayed when DP + PP + CP + TP or DP + PP + TP is selected. After you click the bar chart of a node in the Computation/Communication Overview area, the total duration and usage of the accelerator core of the node are displayed. After you click Details, the details about the computing operator are displayed, as shown in Figure 4. Table 3 describes the fields. You can click in the upper right corner of the table to copy the content displayed in the table and paste the content to an Excel file for analysis.
      Figure 4 Computing operator details
      Table 3 Computing Detail fields

      Field

      Description

      Accelerator Core

      AI accelerator core type, including AI Core and AI CPU.

      Accelerator Core Durations(μs)

      Total duration of the accelerator core.

      Name

      Operator name.

      Type

      Operator type.

      Start Time(ms)

      Operator execution start time.

      Duration(μs)

      Execution duration of the current operator.

      Wait Time(μs)

      Waiting time for executing the operator.

      Block Dim

      Number of running splits, which corresponds to the number of cores during task running.

      Input Shapes

      Operator input shape.

      Input Data Types

      Input data type of the operator.

      Input Formats

      Input format of the operator.

      Output Shapes

      Operator output shape.

      Output Data Types

      Output data type of the operator.

      Output Formats

      Output format of the operator.

    • Communication Detail (Rank ID). This area is displayed when DP + PP + CP + TP or DP + PP + TP is selected. After you click the bar chart of a node in the Computation/Communication Overview area, the total duration of the communication operator of the node (including the communication (not overlapped) duration and the communication (overlapped) duration) is displayed. After you click Details, the details about the communication operator are displayed, as shown in Figure 5. Table 4 describes the fields. You can click in the upper right corner of the table to copy the content displayed in the table and paste the content to an Excel file for analysis.

      If a DB scenario file is imported, the Communication Detail (Rank ID) area is not displayed.

      Figure 5 Communication operator details
      Table 4 Communication Detail fields

      Field

      Description

      Accelerator Core

      AI accelerator core type, including AI Core and AI CPU.

      Communication(Not Overlapped) Durations(μs)

      Non-overlapped communication duration, which refers to the pure communication duration.

      Communication(Overlapped) Durations(μs)

      Overlapped communication duration.

      Name

      Communication operator name.

      Type

      Communication operator type.

      Start Time(ms)

      Start time of communication operator execution.

      Duration(μs)

      Execution duration of the current communication operator.

      Wait Time(μs)

      Waiting time for executing the communication operator.

    • When DP + PP + CP + TP or DP + PP + TP is selected, click a single card icon in the parallel strategy graph, and a flow is displayed. Click the pipeline parallel flow, and the pipeline parallelism chart is displayed, as shown in Figure 6.
      On the pipeline parallelism chart, you can drag either side of the slider below the graph to zoom in or zoom out. You can also move the slider leftward or rightward using the mouse, or press Shift + left or right arrow key to move the parallel graph leftward or rightward.
      Figure 6 Pipeline Parallelism Chart
  • Area 3: MoE expert load balancing analysis, which displays the expert activation heatmap and expert load balancing heatmap.

    You can select Profiling or Dump for Data Version in the parameter configuration area. The two data types are statistical information of MoE models in different dimensions.

    If Profiling is selected, the expert distribution heatmap is displayed. The heatmap is based on Profiling and collects statistics on the time consumed by the GroupedMatmul operator in each MoE layer. Since the GroupedMatmul operator is the core of MoE model computation, its performance directly affects how quickly experts respond.

    If Dump is selected, the MoE model expert load balancing heatmap is displayed. The heatmap is based on Dump and collects statistics on the number of tokens processed by each expert in each MoE layer. You can select Dump unbalanced or Dump balanced and click to import the corresponding file. The MoE model expert load balancing heatmap is displayed, as shown in Figure 7. Table 5 describes the parameters. For details about how to collect data files of Dump unbalanced and Dump balanced, see "Features" > "Load Balancing" in MindIE LLM Development Guide.
    Figure 7 MoE expert load balancing analysis
    Table 5 Parameters for MoE expert load balancing analysis

    Field

    Description

    Model Layer Num

    You can set it to a value ranging from 1 to 500.

    Dense Layer List

    Select one or more layers.

    Expert Num

    You can set it to a value ranging from 1 to 500.

    Model Stage

    Two phases of model inference: Prefill and Decode

    Data Version

    The value can be Profiling, Dump unbalanced, or Dump balanced.