基于Roofline模型的算子瓶颈识别与优化建议

该功能执行分析后通过Workload Analysis（比较工作点和屋顶的相对位置）输出分析结果。输出结果包括：

Op list信息（列出所有工作在此区域的算子信息）：
- 算子名
- 算子AI Core的时间占总AI Core时间的百分比（越大越有优化价值）
- 主要出现瓶颈的通道
- 距离当前的屋顶的百分比（比值越大表示越接近硬件上限瓶颈）
专家系统优化建议

输出结果如下：

图1 Roofline模型的算子信息列表及优化建议
点击放大

点击放大

输出结果是将存在瓶颈算子的基本信息以列表形式输出，并提供优化建议，优化建议内容如下：

表1 Memory Bound内存瓶颈
输出建议	中文含义
Change the data access path to one with higher bandwidth	更改数据访问通路，使用带宽更大的数据访问通路。
Reduce the amount of repeated data migration and increase FLOPS/BYTES	减少数据重复搬移量，增大FLOPS/BYTES。

表2 Compute Bound计算瓶颈
输出建议	中文含义
Change calculation units, for example, replace Vector with Cube	更改计算单元，例如使用Cube替换Vector。
Adopt low-precision computing	使用低精度计算。
Use dual-core	使用双核计算。
Optimize the algorithms to reduce the computation amount	优化算法，减少计算量。

表3 Low Pipeline低流水利用率
输出建议	中文含义
Use the double buffer	使用乒乓策略。
Reduce strong data dependencies between pipelines	优化不合理的流水依赖。
Eliminating improper instruction synchronization between pipelines	消除流水间不合理的指令同步。
Delete redundant pipe_barrier(PIPE_ALL).	删除冗余pipe_barrier（PIPE_ALL）指令。

表4 Latency Compute Bound潜在计算瓶颈
输出建议	中文含义
Increase the number of repeats computed by Vector instructions	增大Vector指令计算的repeat数目。
Check whether the mask setting is proper	检查mask设置是否合理。
Check bank conflict	检查bank conflict。
Use high-performance instructions to replace low-performance instructions	使用高性能指令替换低性能指令。
Reduce the use of long-running instructions	减少使用运行时间长的指令。

表5 Latency Memory Bound潜在内存瓶颈
输出建议	中文含义
Check whether data migration granularity/burst length/burst number are too small	检查数据搬运粒度是否过小。
Reduce unreasonable blocks inside the pipeline	减少流水内部不合理的阻塞。
Avoid read/write resource preemption	避免读写资源抢占。

图2 Roofline模型性能分析概要
点击放大

点击放大

表6 Model Bound Coefficient模型瓶颈系数
字段	说明
Percentage Of Total Op Num	算子数量占比。
Percentage Of AICore Time	AI Core耗时占比。
Coefficent	瓶颈系数，所有算子的加权平均。
Performance	性能优劣，取值为Good/Bad，瓶颈系数Coefficient大于0.8为Good，小于0.8为Bad。
Memory Bound	内存瓶颈。
Compute Bound	计算瓶颈。
Low Pipeline	低流水利用率。
Latency Compute Bound	潜在计算瓶颈。
Latency Memory Bound	潜在内存瓶颈。

父主题： 输出结果和优化建议