GUI Description
Function
The Communication tab page displays the network link performance across the cluster and the communication performance of all nodes. By analyzing the overlapped duration between cluster communication and computation, slow hosts or nodes in the cluster training can be identified.
GUI Display
The Communication tab page displays cluster communication performance from two dimensions: network-wide link display and node-based display. The data is displayed in two parts: Communication Matrix and Communication Duration Analysis.
Communication Matrix
Field |
Description |
|---|---|
Cluster |
Cluster name. You can select a cluster from the drop-down list when importing cluster data. |
Step |
Step ID. You can select a step from the drop-down list. |
Communication Group |
Communication group. You can select one, multiple, or all nodes from the drop-down list. The nodes are displayed on the vertical coordinate. |
Operator Name |
Communication operator name. You can select Total Op info or a type of operator from the drop-down list. The data of communication operators are grouped by type, like "allreduce-total" for easier viewing. When you select the top, bottom, or middle type, move the pointer to the communication matrix heatmap and click a cell to see the specific communication operator name.
|
Matrix Model |
Communication matrix heatmap. |
Communication Matrix Type |
Communication matrix type.
|
Show Inner Communication |
Communication data in the card. This option is not selected by default. |
Visible Range |
Visible range of data. By default, all data is displayed. You can manually set the data display range. |
Src Rank Id |
Source Rank ID. The horizontal coordinate is the ID of the source card in the link information. |
Dst Rank Id |
Destination Rank ID. The vertical coordinate is the ID of the destination card in the link information. |
Communication Duration Analysis
The Communication Duration Analysis option displays the communication performance of a node, including the communication operator thumbnail, communication duration, data analysis, and advice, as shown in Figure 2. Table 2 describes the parameters.
Field |
Description |
|---|---|
Cluster |
Cluster name. You can select a cluster from the drop-down list when importing cluster data. |
Step |
(Mandatory) Step ID. You can select a step from the drop-down list. |
Communication Group |
(Mandatory) Communication group. You can select or search for one, multiple, or all nodes from the drop-down list. The nodes are displayed on the vertical coordinate. |
Operator Name |
(Mandatory) Communication operator name. You can select Total Op Info or a specific operator from the drop-down list. Total Op Info indicates the sum of all communication operator data in the selected Step and communication group. |
Communication Matrix |
(Mandatory) Communication matrix. Either this parameter or Communication Duration Analysis must be set. |
Communication Duration Analysis |
(Mandatory) Communication duration analysis. Either this parameter or Communication Matrix must be set. |
Communication |
Execution sequence and time of communication operators. The slow card information is displayed below the thumbnail. For details, see Quickly Analyzing and Locating Abnormal Communication Operators. |
Rank ID |
Vertical coordinate in the communication operator thumbnail, which indicates Rank ID. |
Time(ms) |
Horizontal coordinate of the communication operator thumbnail, which indicates the operator running time, in milliseconds. |
Visualized Communication Time |
Visualized communication duration chart. |
Time(ms) |
Vertical coordinate on the left of the communication duration chart, which indicates the duration, in milliseconds. |
Ratio |
Vertical coordinate on the right of the communication duration chart, which indicates the duration percentage. |
Data Analysis of Communication Time |
Operator communication duration data analysis. Move the cursor to the table and click |
Rank ID |
Rank ID. |
Elapsed Time(ms) |
Total time of all communication operator events. |
Transit Time(ms) |
Communication duration, which indicates the communication duration of the communication operator, which is the total duration of communication operators on SDMA links (communication within a server) and RDMA links (communication between servers). If the communication duration is too long, a link may be faulty. |
Synchronization Time(ms) |
Synchronization duration, which is required for synchronization between nodes before the first communication between cards. This parameter is used to determine whether the long waiting time is caused by a slow node or a slow link. |
Wait Time(ms) |
Waiting duration. Synchronization is performed before two nodes communicate with each other to ensure that communication is established after the two nodes are synchronized. |
Synchronization Time Ratio |
Synchronization duration ratio. Synchronization Time Ratio = Synchronization Time/(Synchronization Time + Transit Time). A larger synchronization duration ratio before communication indicates a lower communication efficiency, which may be caused by slow cards. |
Wait Time Ratio |
Wait time ratio of a communication operator. Wait Time Ratio = Wait Time/(Wait Time + Transit Time). A larger wait time ratio indicates that the wait time of the node accounts for a larger proportion of the total communication duration, and the communication efficiency is lower. |
Idle Time(ms) |
Duration for communication operator delivery. Idle Time = Elapsed Time – Transit Time – Wait Time |
SDMA BW(GB) |
SDMA bandwidth. |
RDMA BW(GB) |
RDMA bandwidth. |
Bandwidth Analysis |
Bandwidth analysis. Click See more to view the bandwidth details of the specified operator on the corresponding node, as shown in Figure 3. |
Communication Operators Details |
Details about the communication operator. This parameter is displayed only when Operator Name is set to Total Op info. Click See more to view the link details of the communication operator on the corresponding node, as shown in Figure 4. |
Advice |
Advice on the imported data upon analysis. It analyzes the bandwidth, byte alignment, communication retransmission, and communication packets, and provides suggestions, as shown in Figure 5. |
The bandwidth analysis page displays the communication performance of network-wide links, including the communication duration, traffic, bandwidth, and link type. Table 3 describes the fields on the bandwidth analysis page.
Field |
Description |
|---|---|
Packet Number |
Number of communication packets. |
Packet Size(MB) |
Size of a communication packet. |
Transport Type |
Link mode. |
SDMA |
SDMA links (communication links between devices in a node), including HCCS, PCIe, and SIO links. |
RDMA |
RDMA links (inter-node device communication links). |
Transit Size(MB) |
Size of communication packets within one communication process. |
Transit Time(ms) |
Duration of one communication process. |
Bandwidth(GB/s) |
Bandwidth. The bandwidth equals the traffic divided by the communication duration. Empirical bandwidth reference values: RDMA_Bandwidth = 12.5; HCCS_Bandwidth = 18; and PCIe_Bandwidth = 20. |
Large Packet Ratio |
Ratio of large communication packets. It is the ratio of packets whose sizes are big enough to enable the communication link to reach the empirical bandwidth. |
This column displays communication performance by operator, including the communication duration, waiting duration, and synchronization duration of the communication operator. Table 4 describes the fields in the figure.
Field |
Description |
|---|---|
Operator Name |
Communication operator name. |
Elapsed Time(ms) |
Total duration of all events of the communication operators, in milliseconds. |
Transit Time(ms) |
Communication duration, in milliseconds. The communication duration is calculated based on the total duration of the communication operators of the SDMA and RDMA links. |
Synchronization Time(ms) |
Synchronization duration, in milliseconds. It is the waiting time before the first data transmission. |
Wait Time(ms) |
Waiting duration, in milliseconds. Synchronization is performed before two logical cards communicate with each other. |
Synchronization Time Ratio |
Synchronization duration ratio. The calculation formula is Synchronization Time/(Synchronization Time + Transit Time). |
Wait Time Ratio |
Waiting duration ratio. The calculation formula is Wait Time/(Wait Time + Transit Time). |
Idle Time(ms) |
Duration for communication operator delivery. Idle Time = Elapsed Time – Transit Time – Wait Time |
SDMA BW(GB) |
SDMA bandwidth. |
RDMA BW(GB) |
RDMA bandwidth. |
Operation |
Click Show in Timeline to view the corresponding communication operator on the timeline page. Click Show in Thumbnail to view the operator in the communication operator thumbnail. |
The advice provides data analysis including bandwidth description, byte alignment analysis, communication retransmission analysis, and communication packet analysis, and suggestions. You can further locate the slow card and specific operator based on the advice in the parallel strategy analysis on the overview page.
- Bandwidth description: displays the maximum, minimum, and average values of the SDMA and RDMA bandwidths and the differences between the maximum and minimum values in the overview, SDMA, and RDMA dimensions, helping developers quickly identify exceptions.
- Byte, retransmission, and packet analysis: collect statistics on byte alignment data of communication operators, communication retransmission analysis data, communication packet data, and communication bandwidth preemption data, and provides suggestions for developers.





