GUI Description

Function

The Communication tab page displays the network link performance across the cluster and the communication performance of all nodes. By analyzing the overlapped duration between cluster communication and computation, slow hosts or nodes in the cluster training can be identified.

GUI Display

The Communication tab page displays cluster communication performance from two dimensions: network-wide link display and node-based display. The data is displayed in two parts: Communication Matrix and Communication Duration Analysis.

Communication Matrix

The Communication Matrix option displays the information about communication operators in a specified step communication group, including the bandwidth, communication duration, transmit size, and link mode, as shown in Figure 1. Table 1 describes the parameters.
Figure 1 Communication Matrix
Table 1 Communication Matrix fields

Field

Description

Cluster

Cluster name. You can select a cluster from the drop-down list when importing cluster data.

Step

Step ID. You can select a step from the drop-down list.

Communication Group

Communication group. You can select one, multiple, or all nodes from the drop-down list. The nodes are displayed on the vertical coordinate.

Operator Name

Communication operator name. You can select Total Op info or a type of operator from the drop-down list.

The data of communication operators are grouped by type, like "allreduce-total" for easier viewing. When you select the top, bottom, or middle type, move the pointer to the communication matrix heatmap and click a cell to see the specific communication operator name.

  • Total Op info: data of all communication operators in the selected Step and communication group.
  • total: average bandwidth of the communication operators (total transmission volume of a type of communication operators/total transmission time). You are advised to view this type first.
  • top: communication operators with the highest bandwidth. Top N indicates the N highest bandwidths.
  • middle: communication operators with the median bandwidth.
  • bottom: communication operators with the lowest bandwidth. Bottom N indicates the N lowest bandwidths.

Matrix Model

Communication matrix heatmap.

Communication Matrix Type

Communication matrix type.

  • Bandwidth (GB/s): bandwidth (GB/s).
  • Transit Size(MB): communication size.
  • Transport Type: link type.
  • Transit Time(ms): communication duration.

Show Inner Communication

Communication data in the card. This option is not selected by default.

Visible Range

Visible range of data.

By default, all data is displayed. You can manually set the data display range.

Src Rank Id

Source Rank ID. The horizontal coordinate is the ID of the source card in the link information.

Dst Rank Id

Destination Rank ID. The vertical coordinate is the ID of the destination card in the link information.

Communication Duration Analysis

The Communication Duration Analysis option displays the communication performance of a node, including the communication operator thumbnail, communication duration, data analysis, and advice, as shown in Figure 2. Table 2 describes the parameters.

Figure 2 Communication duration analysis
Table 2 Communication Duration Analysis fields

Field

Description

Cluster

Cluster name. You can select a cluster from the drop-down list when importing cluster data.

Step

(Mandatory) Step ID. You can select a step from the drop-down list.

Communication Group

(Mandatory) Communication group. You can select or search for one, multiple, or all nodes from the drop-down list. The nodes are displayed on the vertical coordinate.

Operator Name

(Mandatory) Communication operator name. You can select Total Op Info or a specific operator from the drop-down list. Total Op Info indicates the sum of all communication operator data in the selected Step and communication group.

Communication Matrix

(Mandatory) Communication matrix. Either this parameter or Communication Duration Analysis must be set.

Communication Duration Analysis

(Mandatory) Communication duration analysis. Either this parameter or Communication Matrix must be set.

Communication

Execution sequence and time of communication operators. The slow card information is displayed below the thumbnail. For details, see Quickly Analyzing and Locating Abnormal Communication Operators.

Rank ID

Vertical coordinate in the communication operator thumbnail, which indicates Rank ID.

Time(ms)

Horizontal coordinate of the communication operator thumbnail, which indicates the operator running time, in milliseconds.

Visualized Communication Time

Visualized communication duration chart.

Time(ms)

Vertical coordinate on the left of the communication duration chart, which indicates the duration, in milliseconds.

Ratio

Vertical coordinate on the right of the communication duration chart, which indicates the duration percentage.

Data Analysis of Communication Time

Operator communication duration data analysis. Move the cursor to the table and click to copy the content displayed in the table and paste the content to an Excel file for analysis.

Rank ID

Rank ID.

Elapsed Time(ms)

Total time of all communication operator events.

Transit Time(ms)

Communication duration, which indicates the communication duration of the communication operator, which is the total duration of communication operators on SDMA links (communication within a server) and RDMA links (communication between servers). If the communication duration is too long, a link may be faulty.

Synchronization Time(ms)

Synchronization duration, which is required for synchronization between nodes before the first communication between cards. This parameter is used to determine whether the long waiting time is caused by a slow node or a slow link.

Wait Time(ms)

Waiting duration. Synchronization is performed before two nodes communicate with each other to ensure that communication is established after the two nodes are synchronized.

Synchronization Time Ratio

Synchronization duration ratio.

Synchronization Time Ratio = Synchronization Time/(Synchronization Time + Transit Time). A larger synchronization duration ratio before communication indicates a lower communication efficiency, which may be caused by slow cards.

Wait Time Ratio

Wait time ratio of a communication operator.

Wait Time Ratio = Wait Time/(Wait Time + Transit Time). A larger wait time ratio indicates that the wait time of the node accounts for a larger proportion of the total communication duration, and the communication efficiency is lower.

Idle Time(ms)

Duration for communication operator delivery.

Idle Time = Elapsed TimeTransit TimeWait Time

SDMA BW(GB)

SDMA bandwidth.

RDMA BW(GB)

RDMA bandwidth.

Bandwidth Analysis

Bandwidth analysis.

Click See more to view the bandwidth details of the specified operator on the corresponding node, as shown in Figure 3.

Communication Operators Details

Details about the communication operator. This parameter is displayed only when Operator Name is set to Total Op info.

Click See more to view the link details of the communication operator on the corresponding node, as shown in Figure 4.

Advice

Advice on the imported data upon analysis. It analyzes the bandwidth, byte alignment, communication retransmission, and communication packets, and provides suggestions, as shown in Figure 5.

Figure 3 Bandwidth analysis

The bandwidth analysis page displays the communication performance of network-wide links, including the communication duration, traffic, bandwidth, and link type. Table 3 describes the fields on the bandwidth analysis page.

Table 3 Bandwidth analysis fields

Field

Description

Packet Number

Number of communication packets.

Packet Size(MB)

Size of a communication packet.

Transport Type

Link mode.

SDMA

SDMA links (communication links between devices in a node), including HCCS, PCIe, and SIO links.

RDMA

RDMA links (inter-node device communication links).

Transit Size(MB)

Size of communication packets within one communication process.

Transit Time(ms)

Duration of one communication process.

Bandwidth(GB/s)

Bandwidth. The bandwidth equals the traffic divided by the communication duration.

Empirical bandwidth reference values: RDMA_Bandwidth = 12.5; HCCS_Bandwidth = 18; and PCIe_Bandwidth = 20.

Large Packet Ratio

Ratio of large communication packets. It is the ratio of packets whose sizes are big enough to enable the communication link to reach the empirical bandwidth.

Figure 4 Communication operator details

This column displays communication performance by operator, including the communication duration, waiting duration, and synchronization duration of the communication operator. Table 4 describes the fields in the figure.

Table 4 Communication operator detail fields

Field

Description

Operator Name

Communication operator name.

Elapsed Time(ms)

Total duration of all events of the communication operators, in milliseconds.

Transit Time(ms)

Communication duration, in milliseconds. The communication duration is calculated based on the total duration of the communication operators of the SDMA and RDMA links.

Synchronization Time(ms)

Synchronization duration, in milliseconds. It is the waiting time before the first data transmission.

Wait Time(ms)

Waiting duration, in milliseconds. Synchronization is performed before two logical cards communicate with each other.

Synchronization Time Ratio

Synchronization duration ratio. The calculation formula is Synchronization Time/(Synchronization Time + Transit Time).

Wait Time Ratio

Waiting duration ratio. The calculation formula is Wait Time/(Wait Time + Transit Time).

Idle Time(ms)

Duration for communication operator delivery.

Idle Time = Elapsed TimeTransit TimeWait Time

SDMA BW(GB)

SDMA bandwidth.

RDMA BW(GB)

RDMA bandwidth.

Operation

Click Show in Timeline to view the corresponding communication operator on the timeline page.

Click Show in Thumbnail to view the operator in the communication operator thumbnail.

The advice provides data analysis including bandwidth description, byte alignment analysis, communication retransmission analysis, and communication packet analysis, and suggestions. You can further locate the slow card and specific operator based on the advice in the parallel strategy analysis on the overview page.

  • Bandwidth description: displays the maximum, minimum, and average values of the SDMA and RDMA bandwidths and the differences between the maximum and minimum values in the overview, SDMA, and RDMA dimensions, helping developers quickly identify exceptions.
  • Byte, retransmission, and packet analysis: collect statistics on byte alignment data of communication operators, communication retransmission analysis data, communication packet data, and communication bandwidth preemption data, and provides suggestions for developers.
Figure 5 Advice