Analysis Result Display

Dimension 1: Analyzing Vector Operation

Analyze the simulation dump file.

  1. When the execution efficiency of the operator vector is low, find the vector instruction with low efficiency.
  2. The data movement granularity of MTE2 is low. As a result, the input data amount for vector operation is small.
  3. Check whether to enable multi-core resources.
  4. In multi-core scenarios, check whether the vector operation amount is evenly allocated among cores.

The analysis result is as follows:

Figure 1 Analysis result display (Vector Bound)

You can right-click any row in the Vector Bound area to switch to the operator position in TIK Code or CCE Code. If the operator is not a TIK operator, the TIK Code redirection function is unavailable. The CCE Code redirection function depends on the .cce and .asm files. If either of the two files is missing in the project directory (out > bin > kernel_meta by default), the code redirection function become unavailable. See Figure 2. The .asm file is automatically generated after the operator UT is performed. For details about how to generate the .cce file, see Setting Code Redirection.

Figure 2 Redirection options
Table 1 Fields under Vector Bound

Field

Description

Vector Bound

Vector operation analysis.

Instruction

Instruction name.

First PC

Instruction address. You can search for the corresponding repeated vector instruction in the simulation dump file based on the value.

Execution Times

Number of times that a vector instruction is repeatedly executed.

Repeat

Corresponds to the repeat_times parameter, which indicates the number of iterations of a vector instruction. The value range is (0, 255].

Mask

Element involved in vector computation. The value range is [1, 128], and the unit is bit. If a bit is set to 0, the corresponding element of the vector is masked in the computation. If a bit is set to 1, the corresponding element of the vector participates in the computation.

For details about the tuning analysis parameters of this operator, see "API Reference > TBE TIK APIs > Vector Computation > Single Source (Gather Mode)" in the TBE & AI CPU Operator Developer Guide.

Tuning Suggestions:

  1. Try to optimize vector instructions.

    Optimize vector instructions.

  2. Try to optimize out->ub data move instructions.

    Optimize OUT-to-UB data movement instructions.

  3. Try to use multi-core resources.

    Enable multi-core resources.

  4. Try to reallocate vector calculations to these cores: 0 1.

    Re-allocate vector operation data to these cores: 0 1

Instruction optimization: modify the mask and repeat_times parameters to replace the vector instruction that is repeatedly executed.

Based on the value of the First PC field, you can find the corresponding repeated vector instruction in the simulation dump file. As shown in the preceding figure, the value of Execution Times is 64, the value of Repeat is 1, and that of Mask is 64. The instruction is repeatedly executed for 10 times. Only one iterative computation is performed for each execution, and only one element in the vector is calculated each time. To improve the execution efficiency of the instruction, you can change the value of repeat_times in the operator code to 10, and then change the mask value based on the number of vector bits and actual requirements. After the modification, the instruction only needs to be executed once to complete 10 iterative computations, and each computation covers all elements required in a specific scenario, thereby improving the execution efficiency of the instruction.

Dimension 2: Analyzing Pipeline Interruption

Based on the simulation dump file, analyze the pipeline that accounts for the largest proportion from the following three dimensions:

  1. Nonconsecutive pipeline caused by other pipelines.
  2. Nonconsecutive pipeline caused by adding instructions to the queue.
  3. Pipeline interruption caused by the pipe_barrier(PIPE_ALL) command.

Based on the analysis result, sort the cycles that are affected by the preceding three dimensions. The result is as follows:

Figure 3 Analysis result display (Pipeline Interruptions)
Table 2 Fields under Pipeline Interruptions

Field

Description

Pipeline Interruption

Analyzes pipeline interruptions.

Interruption Factor

Pipeline interruption factor.

Affected Pipeline

Affected pipelines.

Interruption Cycles

Cycles in which pipelines are interrupted.

Percentage to Total

Percentage of the number of interrupted cycles to the total number of cycles.

Tuning Suggestions:

  1. Try to use the double buffer for UB.

    Use the ping-pong policy.

  2. Reduce strong data dependencies between pipelines.

    Optimize improper pipeline dependencies.

  3. Eliminate improper instruction synchronization between pipelines.

    Eliminate improper instruction synchronization between pipelines.

  4. Delete redundant pipe_barrier(PIPE_ALL).

    Delete the redundant pipe_barrier (PIPE_ALL) command.

Dimension 3: Analyzing Scalar Operation

Based on the simulation dump file, collect statistics on the number of times that scalar instructions are executed and the total execution cycles. Then, sort the instructions by the total execution cycle and select the top 5 scalar instructions. The analysis result is as follows:

Figure 4 Analysis result display (Scalar Bound)
Table 3 Fields under Scalar Bound

Field

Description

Scalar Bound

Scalar operation analysis.

Instruction

Instruction name.

Execution Times

Number of times that a scalar instruction is repeatedly executed.

Execution Cycles

Total execution period of a scalar instruction.

Tuning Suggestions:

  1. Try to adjust tiling policy.

    Adjust the tiling policy.

  2. Try to optimize the implementation solution.

    Optimize the implementation solution.

  3. Try to replace instructions with poor performance.

    Replace instructions with poor performance.

Dimension 4: Analyzing Memory Bound

Analyze the simulation dump file to find out performance bottlenecks of memory transfer.

  1. Small-packet transfer: The threshold of the OUT->UB/UB->OUT/L1->UB channel is not reached during an operator operation.

    The threshold of the OUT->UB/UB->OUT/L1->UB channel is as follows: A maximum vector operation can calculate the sum or multiplication of two 128-bit FP16 vectors, that is, 128 x 2 x 2 B = 512 B.

  2. Redundant transfer: Redundancy = Transfer amount/Calculation amount. If the redundancy is greater than 1.2, redundant transfer exists.

    The transfer amount is that from OUT to UB, and the calculation amount is the vector calculation volume.

  3. Bandwidth preemption: Analyze whether the execution time of OUT->UB/OUT->L1 transfer instructions overlaps and find out the possible mte2 bandwidth preemption.

The analysis result is as follows:

Figure 5 Analysis result display (Memory Bound)

Tuning Suggestions:

  1. Small packets are transferred in such channels as out->ub. Combine the transfer instructions for optimization.

    Small-packet transfer exists in the OUT->UB channel. Combine the transfer instructions for tuning.

  2. Redundant transfer exists. Optimize the data transfer policy.

    Redundant transfer exists. Optimize the data transfer policy.

  3. Bandwidth preemption exists. Adjust the transfer instruction sequence.

    Bandwidth preemption occurs. Adjust the transfer instruction time sequence.