Operator

The Operator page displays the duration statistics of computing and communication operators. The common functions are as follows:

Collect statistics by type to observe the time consumption ratio of operators, especially whether the ratio of low-efficiency operators such as conversion operators is too high.
Collect statistics by accelerator core groups to determine whether the time consumption of AI CPU operators or vector operators is abnormally high.
Collect statistics on computing operators by input shape to determine whether the operators deteriorate in a specific shape.
You can switch to display the top 15 or all operators, as shown in Figure 1.

For details about the operator locating and tuning methods, see Operator Performance Tuning Solutions.

Figure 1 Operator page

The Operator page also supports the comparison between two cards. For details, see "Instructions" in MindStudio Insight User Guide.

Typical Case: Using the Operator Comparison to Quickly Locate the Computing Performance Deterioration

Background: The same model is deployed on different devices, but the computing performance degrades (about 80 ms per step). The root cause needs to be identified.

Move the performance data of the two cards to the same parent directory and use MindStudio Insight to open the parent directory.
Set data comparison between two cards by referring to "Instructions" in MindStudio Insight User Guide.
The single-step computing time of a slow card is about 80 ms longer than that of a fast card. After the comparison mode is enabled, sort the operators by Total Time in the Operator Details, as shown in Figure 2. In the figure, the number of operators is the same (difference = 0), while the total duration differs by approximately 74 ms. This indicates that the MatMul operators are the primary source to the time difference.
Figure 2 Operator comparison between cards
Compare operators of the same shape. As shown in Figure 3, operators of the same type (MatMulV3) but different shapes have different degrees of deterioration. In each shape, the MatMul of a slow card deteriorates more stably than the fast card.
Figure 3 Shape-based comparison
The analysis indicates that differences in on-chip memory across devices are the root cause. As the MatMul operator is memory-intensive, varying levels of compute and communication bandwidth preemption result in the performance gap.

Parent topic: Single-Card Performance Analysis