Analyzing Profile Data

Theoretical Parameters

Theoretical performance is the ideal objective of actual operator performance. Hardware specifications vary with hardware platforms. The theoretical performance helps you understand the potential of hardware and set performance optimization objectives.

Theoretical time required for data transfer pipelines (such as MTE1, MTE2, and MTE3) = Amount of transferred data (unit: byte)/Theoretical bandwidth. For example, the GM peak bandwidth of an AI Processor is about 1.8 TB/s. If you want to transfer a 4096 x 4096 matrix of the float type, the theoretical transfer time is as follows: sizeof(float) x 4096 x 4096/1.8 TB/s = 37.28 µs, where 1 TB = 10¹² bytes.
- If transfer commands exist at the same time, the bandwidth may be shared, and data cannot be transferred at a rate close to the theoretical bandwidth. For example, when MTE2 and MTE3 perform GM read and write operations at the same time, the time consumed by the transfer pipeline is calculated as follows: (Transfer amount of MTE2 + Transfer amount of MTE3)/GM bandwidth.
- The bandwidth utilization (effective bandwidth/theoretical bandwidth) varies according to the size of a data block to be transferred. If a small amount of data is transferred each time, the actual performance cannot reach the theoretical bandwidth.

Theoretical time required for computing pipelines (such as Cube, Vector, and Scalar) = Computation amount (unit: element)/Theoretical computing power. For example, if the theoretical peak computing power of a AI processor for float vectors is 11.06 TOPS, the theoretical time required for performing a single instruction compute of 32K float elements is 32K/11.06 TOPS = 0.003 μs (computed based on 1K = 1000).

How to Find Bottlenecks

After obtaining profile data, time-consuming process points where deviation from the theoretical value is large are considered as "bottlenecks". The following describes how to find bottlenecks and optimization directions based on profile data.

Method 1: Analyze the pipeline through on-board profiling.
View the op_summary_*.csv file parsed by on-board profiling. Note: The asterisk (*) indicates the timestamp.

Figure 1 Example 1 of op_summary_*.csv

The ideal utilization of each pipeline should be 100%. Less than 100% means there may be room for improvement. The preceding figure shows the data obtained from an AI Processor. For the Cube operator MatMulV2, the Cube pipeline utilization aic_mac_ratio is about 80%, implying that the computing power is not fully utilized. aic_mte2_ratio of MTE2 is about 95%. Therefore, MTE2 is the longest pipeline.

Compare the longest pipeline with the theoretical difference. The shapes of the input left and right matrices are (2048, 12288) and (12288, 6144) respectively, in bfloat16. The shape of the bias input is 6144 in float. Therefore, the total amount of data to be transferred can be computed. According to Theoretical Parameters, the theoretical time for the data transfer is as follows: (sizeof(bfloat16) x (2048 x 12288 + 12288 x 6144) + sizeof(float) x 6144)/1.8 TB/s ≈ 111.8 µs, where 1 TB =10¹² bytes. The theoretical value is greatly different from the actual profile data aic_mte2_time. The analysis finds that the total size of the input data exceeds the L1 space size (512 KB). During the MatMul compute, the input matrix data may be repeatedly transferred. To check whether the number of repeated transfers is reasonable, pipeline optimization and tiling optimization are required. For details, see Method 3: Analyze the simulation pipeline chart.

Figure 2 Example 2 of op_summary_*.csv

In the preceding figure, the shape of the operator input is (8192, 8192), in float. Therefore, the total amount of data to be transferred can be computed. According to Theoretical Parameters, the theoretical time required for the data transfer is as follows: sizeof(float) x (8192 x 8192)/0.8 TB/s ≈ 335.5 µs, where 1 TB =10¹² bytes and the theoretical bandwidth varies depending on the AI Processor. The theoretical value is consistent with the actual profile data aiv_mte2_time, so it can be determined that the data transfer of the operator almost reaches the MTE2 bound. In this example, the total execution duration is 350 μs, which is the same as the actual duration of MTE2, indicating that the operator has been properly tuned. In case that there is a large gap between the MTE2 duration and the total execution time, the next step is to implement pipeline optimization and tiling optimization, to hide other pipelines in the MTE2 pipeline. See Method 3: Analyze the simulation pipeline chart.

Method 2: Analyze tiling through on-board profiling
View the op_summary_*.csv file parsed by on-board profiling.

Figure 3 Example of op_summary_*.csv

The preceding figure shows the data obtained from an AI Processor. According to the hardware platform, the AI Processor has 48 vector cores. The Mul operator is simply a vector operator. However, in some scenarios, not all vector cores are used (Block Dim < 48), causing computing power waste. In this case, the next direction is tiling optimization.
Method 3: Analyze the simulation pipeline chart.
Figure 4 Example of the simulation pipeline chart

The preceding figure shows the data obtained from an AI Processor. It can be seen that the pipelines related to vector cores (such as MTE2 and MTE3 of vec0 as well as MTE2 and MTE3 of vec1) are regularly interrupted. Analyze the operator logic to check whether pipeline interruption is caused by factors such as data dependency. So the next direction is pipeline optimization as the primary method while tiling optimization and memory optimization as the secondary methods, to further improve vector pipeline utilization.
Method 4: Checking the header overhead through on-board profiling
The header overhead is the latency generated before the operator performs compute, including the latency caused by core startup, core address fetching TLB MISS, access to the same address (additional latency caused by conflicts when multiple cores access the same memory address at the same time due to hardware restrictions), and variable resource initialization. Take Atlas A2 training products/Atlas A2 inference products as an example. The full-core header overhead is about 20–21 μs. For operators whose latency is at the microsecond level, such as in the inference domain, the header overhead is an object worth optimizing.

By analyzing the on-board profiling data (TaskDuration data when the kernel is empty), you can see the startup overhead of each core. Then, by using the appropriate number of cores and operator kernel types, you can continuously experiment to find the optimal configuration. For details about the optimization direction, see Header and Tailer Overhead Optimization.

Parent topic: Performance Analysis