Analyzing Profile Data

Theoretical Parameters

Theoretical performance is the ideal objective of actual operator performance. As hardware specifications vary according to the hardware platform, we can use theoretical parameters to understand the potential of hardware and set performance optimization objectives and expectations.

  • Theoretical time required for data transfer pipelines (such as MTE1, MTE2, and MTE3) = Amount of transferred data (unit: byte)/Theoretical bandwidth. For example, the GM peak bandwidth of an AI Processor is about 1.8 TB/s. If you want to transfer a 4096 x 4096 matrix of the float type, the theoretical transfer time is as follows: sizeof(float) x 4096 x 4096/1.8 TB/s = 37.28 µs, where 1 TB = 1012 bytes.
    • If transfer commands exist at the same time, the bandwidth may be shared, and data cannot be transferred at a rate close to the theoretical bandwidth. For example, when MTE2 and MTE3 perform GM read and write operations at the same time, the time consumed by the transfer pipeline is calculated as follows: (Transfer amount of MTE2 + Transfer amount of MTE3)/GM bandwidth.
    • The bandwidth utilization (effective bandwidth/theoretical bandwidth) varies according to the size of a data block to be transferred. If a small amount of data is transferred each time, the actual performance cannot reach the theoretical bandwidth.
  • Theoretical time required for computing pipelines (such as Cube, Vector, and Scalar) = Computation amount (unit: element)/Theoretical computing power. For example, the theoretical peak compute power of an AI Processor's Vector unit to process float data is 11.06 TOPS. If you want to perform a single-instruction computation of 32 KB elements, the theoretical time required for the computation is as follows: 32 KB/11.06 TOPS = 0.003 µs, where 1 KB = 1000 bytes.

How to Find Bottlenecks

After obtaining profile data, time-consuming process points where deviation from the theoretical value is large are considered as "bottlenecks". The following describes how to find bottlenecks and optimization directions based on profile data.

  • Method 1: Analyze the pipeline through profiling on the board.

    View the op_summary_{}.csv file parsed by profiling on the board.

    Figure 1 Example 1 of op_summary_{}.csv

    The ideal utilization of each pipeline should be 100%. Less than 100% means there may be room for improvement. The preceding figure shows the data obtained from an AI Processor. For the Cube operator MatMulV2, the Cube pipeline utilization aic_mac_raito is about 80%, implying that the computing power is not fully utilized. aic_mte2_ratio of MTE2 is about 95%. Therefore, MTE2 is the longest pipeline.

    Compare the longest pipeline with the theoretical difference. The shapes of the input left and right matrices are (2048, 12288) and (12288, 6144) respectively, in bfloat16. The shape of the bias input is 6144 in float. Therefore, the total amount of data to be transferred can be calculated. According to Theoretical Parameters, the theoretical time for the data transfer is as follows: (sizeof(bfloat16) x (2048 x 12288 + 12288 x 6144) + sizeof(float) x 6144)/1.8 TB/s ≈ 111.8 µs, where 1 TB =1012 bytes. The theoretical value is greatly different from the actual profile data aic_mte2_time. The analysis finds that the total size of the input data exceeds the L1 space size (512 KB). During the MatMul computation, the input matrix data may be repeatedly transferred. To check whether the number of repeated transfers is reasonable, pipeline optimization and tiling optimization are required. For details, see Method 3: Analyze the simulation pipeline chart.

    Figure 2 Example 2 of op_summary_{}.csv

    In the preceding figure, the shape of the operator input is (8192, 8192), in float. Therefore, the total amount of data to be transferred can be calculated. According to Theoretical Parameters, the theoretical time required for the data transfer is as follows: sizeof(float) x (8192 x 8192)/0.8 TB/s ≈ 335.5 µs, where 1 TB =1012 bytes and the theoretical bandwidth varies depending on the AI Processor. The theoretical value is consistent with the actual profile data aic_mte2_time, so it can be determined that the data transfer of the operator almost reaches the MTE2 bound. In this example, the total execution duration is 350 μs, which is the same as the actual duration of MTE2, indicating that the operator has been properly tuned. In case that there is a large gap between the MTE2 duration and the total execution time, the next step is to implement pipeline optimization and tiling optimization, to hide other pipelines in the MTE2 pipeline. See Method 3: Analyze the simulation pipeline chart.

  • Method 2: Analyze tiling through profiling on the board

    View the op_summary_{}.csv file parsed by profiling on the board.

    Figure 3 Example of op_summary_{}.csv

    The preceding figure shows the data obtained from an AI Processor. According to the hardware platform, the AI Processor has 48 vector cores. The Mul operator is simply a vector operator. However, in some scenarios, not all vector cores are used (Block Dim < 48), causing computing power waste. In this case, the next direction is tiling optimization.

  • Method 3: Analyze the simulation pipeline chart.
    Figure 4 Example of the simulation pipeline chart

    The preceding figure shows the data obtained from an AI Processor. It can be seen that the pipelines related to vector cores (such as MTE2 MET3 of vec0 and MTE2 MTE3 of vec1) are regularly interrupted. Analyze the operator logic to check whether pipeline interruption is caused by factors such as data dependency. So the next direction is pipeline optimization as the primary method while tiling optimization and memory optimization as the secondary methods, to further improve vector pipeline utilization.

  • Method 4: View the header overhead (extra overhead before the operator starts formal computation) through the simulation pipeline chart.
    Figure 5 Example of the simulation pipeline chart

    The preceding figure shows the data obtained from an AI Processor. Before the operator starts formal computation, the data is occupied by Scalar and MTE3 pipelines. As a result, there is fixed header overhead. The next direction is transfer optimization and instruction optimization, to reduce the transfer time and Scalar computation time.