Analysis Result Display
Dimension 1: Analyzing Vector Operation
Analyze the simulation dump file.
- When the execution efficiency of the operator vector is low, find the vector instruction with low efficiency.
- The data movement granularity of MTE2 is low. As a result, the input data amount for vector operation is small.
- Check whether to enable multi-core resources.
- In multi-core scenarios, check whether the vector operation amount is evenly allocated among cores.
The analysis result is as follows:
You can right-click any row in the Vector Bound area to switch to the operator position in TIK Code or CCE Code. If the operator is not a TIK operator, the TIK Code redirection function is unavailable. The CCE Code redirection function depends on the .cce and .asm files. If either of the two files is missing in the project directory (out > bin > kernel_meta by default), the code redirection function is unavailable. See Figure 2. The .asm file is automatically generated after the operator UT is performed. For details about how to generate the .cce file, see Setting Code Redirection.
|
Field |
Description |
|---|---|
|
Vector Bound |
Vector operation analysis. |
|
Instruction |
Instruction name. |
|
First PC |
Instruction address. You can search for the corresponding repeated vector instruction in the simulation dump file based on the value. |
|
Execution Times |
Number of times that a vector instruction is repeatedly executed. |
|
Repeat |
Corresponds to the repeat_times parameter, which indicates the number of iterations of a vector instruction. The value range is (0, 255]. |
|
Mask |
Element involved in vector computation. The value range is [1, 128], and the unit is bit. If a bit is set to 0, the corresponding element of the vector is masked in the computation. If a bit is set to 1, the corresponding element of the vector participates in the computation. |
For details about the tuning analysis parameters of this operator, see "API Reference > TBE TIK APIs > Vector Computation > Single Source (Gather Mode)" in the Operator Development Guide.
Tuning Suggestions:
Instruction optimization: modify the mask and repeat_times parameters to replace the vector instruction that is repeatedly executed.
Based on the value of the First PC field, you can find the corresponding repeated vector instruction in the simulation dump file. As shown in the preceding figure, the value of Execution Times is 64, the value of Repeat is 1, and that of Mask is 64. The instruction is repeatedly executed for 10 times. Only one iterative computation is performed for each execution, and only one element in the vector is calculated each time. To improve the execution efficiency of the instruction, you can change the value of repeat_times in the operator code to 10, and then change the mask value based on the number of vector bits and actual requirements. After the modification, the instruction only needs to be executed once to complete 10 iterative computations, and each computation covers all elements required in a specific scenario, thereby improving the execution efficiency of the instruction.
Dimension 2: Analyzing Pipeline Interruption
Based on the simulation dump file, analyze the pipeline that accounts for the largest proportion from the following three dimensions:
- Nonconsecutive pipeline caused by other pipelines.
- Nonconsecutive pipeline caused by adding instructions to the queue.
- Pipeline interruption caused by the pipe_barrier(PIPE_ALL) command.
Based on the analysis result, sort the cycles that are affected by the preceding three dimensions. The result is as follows:
|
Field |
Description |
|---|---|
|
Pipeline Interruption |
Analyzes pipeline interruptions. |
|
Interruption Factor |
Pipeline interruption factor. |
|
Affected Pipeline |
Affected pipelines. |
|
Interruption Cycles |
Cycles in which pipelines are interrupted. |
|
Percentage to Total |
Percentage of the number of interrupted cycles to the total number of cycles. |
Tuning Suggestions:
Dimension 3: Analyzing Scalar Operation
Based on the simulation dump file, collect statistics on the number of times that scalar instructions are executed and the total execution cycles. Then, sort the instructions by the total execution cycle and select the top 5 scalar instructions. The analysis result is as follows:
|
Field |
Description |
|---|---|
|
Scalar Bound |
Scalar operation analysis. |
|
Instruction |
Instruction name. |
|
Execution Times |
Number of times that a scalar instruction is repeatedly executed. |
|
Execution Cycles |
Total execution period of a scalar instruction. |
Tuning Suggestions:
Dimension 4: Analyzing Memory Bound
Analyze the simulation dump file to find out performance bottlenecks of memory transfer.
- Small-packet transfer: The threshold of the OUT->UB/UB->OUT/L1->UB channel is not reached during an operator operation.
The threshold of the OUT->UB/UB->OUT/L1->UB channel is as follows: A maximum vector operation can calculate the sum or multiplication of two 128-bit FP16 vectors, that is, 128 x 2 x 2 B = 512 B.
- Redundant transfer: Redundancy = Transfer amount/Calculation amount. If the redundancy is greater than 1.2, redundant transfer exists.
The transfer amount is that from OUT to UB, and the calculation amount is the vector calculation volume.
- Bandwidth preemption: Analyze whether the execution time of OUT->UB/OUT->L1 transfer instructions overlaps and find out the possible mte2 bandwidth preemption.
The analysis result is as follows:
Tuning Suggestions:
