基于Roofline模型的算子瓶颈识别与优化建议

该功能执行分析后通过Workload Analysis（比较工作点和屋顶的相对位置）输出分析结果。输出结果包括Op list信息为列出所有工作在此区域的算子信息，包含算子名、算子AI Core的时间占总AI Core时间的百分比（越大越有优化价值）、主要出现瓶颈的通道、距离当前的屋顶的百分比（比值越大表示越接近硬件上限瓶颈）；以及专家系统优化建议。

输出结果如下：

图1 Roofline模型的算子信息列表及优化建议
点击放大

输出结果是将存在瓶颈算子的基本信息以列表形式输出，并提供优化建议，优化建议内容如下：

Memory Bound

# 内存瓶颈。

Change the data access path to one with higher bandwidth
# 更改数据访问通路，使用带宽更大的数据访问通路。
Reduce the amount of repeated data migration and increase FLOPS/BYTES
# 减少数据重复搬移量，增大FLOPS/BYTES。

Compute Bound

# 计算瓶颈。

Change calculation units, for example, replace Vector with Cube
# 更改计算单元，例如使用Cube替换Vector。
Adopt low-precision computing
# 使用低精度计算。
Use dual-core
# 使用双核计算。
Optimize the algorithms to reduce the computation amount
# 优化算法，减少计算量。

Low Pipeline

# 低流水利用率。

Use the double buffer
# 使用乒乓策略。
Reduce strong data dependencies between pipelines
# 优化不合理的流水依赖。
Eliminating improper instruction synchronization between pipelines
# 消除流水间不合理的指令同步。
Delete redundant pipe_barrier(PIPE_ALL).
# 删除冗余pipe_barrier（PIPE_ALL）指令。

Latency Compute Bound

# 潜在计算瓶颈。

Increase the number of repeats computed by Vector instructions
# 增大Vector指令计算的repeat数目。
Check whether the mask setting is proper
# 检查mask设置是否合理。
Check bank conflict
# 检查bank conflict。
Use high-performance instructions to replace low-performance instructions
# 使用高性能指令替换低性能指令。
Reduce the use of long-running instructions
# 减少使用运行时间长的指令。

Latency Memory Bound

# 潜在内存瓶颈。

Check whether data migration granularity/burst length/burst number are too small
# 检查数据搬运粒度是否过小。
Reduce unreasonable blocks inside the pipeline
# 减少流水内部不合理的阻塞。
Avoid read/write resource preemption
# 避免读写资源抢占。

图2 Roofline模型性能分析概要
点击放大

Model Bound Coefficient

# 模型瓶颈系数。

Percentage Of Total Op Num：算子数量占比。
Percentage Of AICore Time：AI Core耗时占比。
Coefficent：瓶颈系数，所有算子的加权平均。
Performance：性能优劣，取值为Good/Bad，瓶颈系数Coefficent大于0.8为Good，小于0.8为Bad。
Memory Bound：内存瓶颈。
Compute Bound：计算瓶颈。
Low Pipeline：低流水利用率。
Latency Compute Bound：潜在计算瓶颈。
Latency Memory Bound：潜在内存瓶颈。

父主题： 输出结果和优化建议