基于训练场景的优化推荐按照数据增强、正向-反向计算、梯度更新三个方面输出瓶颈识别和优化建议，并提供训练前置优化建议。

图1 基于训练场景的优化推荐
点击放大

训练前置优化

Pre Optimizations
1. Check whether to enable iteration offload
2. Check whether to enable AOE
3. Check whether to enable mixed precision
4. Check whether to enable training process core binding

优化建议：确认是否执行前置优化，前置优化包括训练迭代循环下沉、使能AOE自动调优、混合精度、训练进程绑核等。

数据增强优化推荐

Data Augmentation Result Data
Data Augmentation Percentage: 1.446134; Data Augmentation Threshold: 0.100000;

Data Augmentation Optimizations
1. Host side: Dot the sess run and data processing phases in the training script and analyze the performance bottleneck based on the dotting time
2. Host side: Use the same platform (x86 or ARM) for comparison
3. Device side: Check the operator time consumptions

瓶颈识别：数据增强阶段耗时较大，耗时占比超过阈值0.1时建议优化。

优化建议：确认Host侧计算平台处理器类型，arm单核能力弱于x86。

对训练脚本进入迭代后，sess run、数据处理等阶段打点，根据打点时间分析性能瓶颈。如果确认训练脚本进入循环迭代后数据梳理耗时较大，建议用户优化脚本，如迭代下沉、单次处理多份数据、提高并行度。
部分数据增强操作可能在device侧执行，耗时可能和AI CPU算子耗时相关。

正向-反向计算优化推荐

FPBP Result Data
FPBP Percentage: 0.686843; FPBP Threshold: 0.500000; AiCpu Percentage: 0.985026; AiCpu Threshold: 0.500000;

Top 5 AICPU Operator
Op Name： xxx
...

FPBP Optimizations
1. Convert AI CPU operators into AI Core operators

瓶颈识别：

FP_BP阶段耗时，AI CPU或AI Core算子执行时间，耗时占比超过阈值0.5时建议优化。

优化建议：

如果AI CPU算子执行时间占比较高，建议把Top AI CPU算子转为AI Core算子；如果AI Core算子执行时间占比较高，建议参考UB模型、Roofline模型分析算子融合优化、算子瓶颈识别及优化推荐。

梯度更新优化推荐

GradReference Result Data
AR1 Start Time(us): 51518599168.000000; AR1 Duration Time(us): 4732.520000; BP End Time(us): 51518599168.000000; AR1 Percentage: 0.587337; AR1 Threshold: 0.500000;

GradReference Optimizations
1. AR1 is not hidden in the BP phase. Optimize the segmentation policy

瓶颈识别：

AR与FPBP未并行；AR切分比例不合理；AR切分后未并行部分耗时占比超过阈值0.5时建议优化。

优化建议：

开启hcom_parallel选项。
调整AR切分融合策略。