Timeline

The Timeline page intuitively displays the running details of both the host and devices during training or inference. It shows the API execution duration on the host side and the task execution duration on the device side. Figure 1 shows the common units on the GUI. Table 1 describes the information on the GUI.

Figure 1 Common Timeline units and the GUI

**Table 1** Common Timeline units and GUI description
No.	Name	Description
1	Python lane (level-1 pipeline)	Displays the code at the Python layer. During collection, you can enable the with stack function to view the code call stack.
2	CANN lane (level-2 pipeline)	Collects data such as ACL API execution, GE convergence, and Runtime. Python operators are delivered from the level-1 pipeline to the level-2 pipeline. Tasks are dequeued from the level-2 pipeline and then delivered to the NPU layer.
3	Ascend Hardware (NPU layer)	Also called the device side, it records the execution time sequence of tasks such as computing and communication on the NPU.
4	AI Core Freq	The AI Core frequency is used to observe frequency reduction issues.
5	Communication	Formerly called HCCL lane, it records the communication events at the NPU layer, corresponding to the communication sub-lane of Ascend Hardware. The events are reported by components such as HCCL. This lane can be used to locate communication details.
6	Overlap Analysis	The computing and communication tasks of Ascend Hardware (NPU layer) are vertically projected to this lane to display the computing, communication, and free time. It is used to quickly compare the differences of computing, communication, and free time between different cards.
7	Stats System View	Statistics summary of a single card. You can select a card from the Rank ID drop-down list box on the left.

This section lists the most commonly used lanes and functions during timeline locating. You can expand each lane to view details, as shown in Figure 2. For details about the GUI, see "System Tuning" > "Timeline" in MindStudio Insight User Guide.

Figure 2 Expanding lanes to view details

Common Operations

To quickly view or learn about all shortcut key operations, click the question mark button in the upper right corner of the page and choose Keyboard shortcuts from the drop-down list.

Common timeline operations include jumping between communication and timeline tab pages, pinning and comparing, overlap analysis, flag marking of key areas, box selection and statistics, and delivery connection relationship viewing. For details, see "System Tuning" > "Timeline" in MindStudio Insight User Guide.

Locating the Difference Source of the Fast and Slow Cards

The Timeline page is used to further locate the difference source of the fast and slow cards. Ideally, the computing time of each card is similar. It is abnormal for one card to finish computing early and wait a long time for another card. If some cards have long-duration communication operators that spend most of their time waiting (for example, in a Notify Wait event), check for a fast and slow card issue.

Fast and slow cards are a phenomenon with multiple possible causes. You can determine the causes by comparing the differences between fast and slow cards on the Timeline page. Common causes of slow cards include load imbalance, slow computing, slow delivery, and slow data loading (for example, storage-related issues). The locating process is as follows:

In the communication operator thumbnail on the Communication page, find the communication operator with big difference and switch to the Timeline page.
Determine the source of differences at the Ascend hardware layer (NPU layer) through overlap analysis and pinning and comparing.
On the Timeline page, select async_npu to view its delivery connections. Based on the connection relationships, trace upstream along the NPU layer and identify the Python-layer source where the difference occurs. After the Python layer code is confirmed, contact the model development or O&M personnel to further locate the root cause.

The Timeline page provides a wide range of functions. For detailed use cases, see Fast and Slow Card Locating Case on the Timeline.

Observing the Delivery Bottleneck

The Timeline page provides an efficient tool for observing delivery problems. In ideal situations, the computing pipeline on the NPU side runs continuously, and the NPU will not wait for a CPU. However, if the delivery is slow, the pipeline cannot run properly and the computing power utilization of the AI Core decreases.

The ideal Free Time proportion is less than 10%.

The typical delivery bottlenecks on the Timeline are as follows: For details about the troubleshooting and tuning methods, see Task Dispatch Anomaly Analysis.

The ratio of Free in Overlap Analysis is higher than that of Computing and Communication, as shown in Figure 3 and Figure 4.
Figure 3 Typical delivery bottleneck 1

Figure 4 Typical delivery bottleneck 2
The HostToDevice line is nearly vertical, as shown in Figure 5.
Figure 5 Typical delivery bottleneck 3
Frequent HostToDevice copy interrupts the asynchronous pipeline, causing a delivery bottleneck, as shown in Figure 6.
Figure 6 Typical delivery bottleneck 4

Single-Card Statistics and Operator Searching

If you want to view the position of an operator on the Timeline page, select the System View in the data pane at the bottom, select the Stats System View and the corresponding Rank SN, and click the Kernel Details to view all operators. In addition, you can filter operators by name, type, accelerator core, and input/output shape and sort operators by duration. Choose an operator and click Click in the Click to Timeline column to go to the corresponding Timeline view, as shown in Figure 7. This operation is faster than global search.

Figure 7 Operator details

Parent topic: Single-Card Performance Analysis