基于训练场景的优化推荐按照数据增强、正向-反向计算、 梯度更新三个方面输出瓶颈识别和优化建议,并提供训练前置优化建议。
Pre Optimizations 1. Check whether to enable iteration offload 2. Check whether to enable AOE 3. Check whether to enable mixed precision 4. Check whether to enable training process core binding
优化建议:确认是否执行前置优化,前置优化包括训练迭代循环下沉、使能AOE自动调优、混合精度、训练进程绑核等。
Data Augmentation Result Data Data Augmentation Percentage: 1.446134; Data Augmentation Threshold: 0.100000; Data Augmentation Optimizations 1. Host side: Dot the sess run and data processing phases in the training script and analyze the performance bottleneck based on the dotting time 2. Host side: Use the same platform (x86 or ARM) for comparison 3. Device side: Check the operator time consumptions
瓶颈识别:数据增强阶段耗时较大,耗时占比超过阈值0.1时建议优化。
优化建议:确认Host侧计算平台处理器类型,arm单核能力弱于x86。
FPBP Result Data FPBP Percentage: 0.686843; FPBP Threshold: 0.500000; AiCpu Percentage: 0.985026; AiCpu Threshold: 0.500000; Top 5 AICPU Operator Op Name: xxx ... FPBP Optimizations 1. Convert AI CPU operators into AI Core operators
瓶颈识别:
FP_BP阶段耗时,AI CPU或AI Core算子执行时间,耗时占比超过阈值0.5时建议优化。
优化建议:
如果AI CPU算子执行时间占比较高,建议把Top AI CPU算子转为AI Core算子;如果AI Core算子执行时间占比较高,建议参考UB模型、Roofline模型分析算子融合优化、算子瓶颈识别及优化推荐。
GradReference Result Data AR1 Start Time(us): 51518599168.000000; AR1 Duration Time(us): 4732.520000; BP End Time(us): 51518599168.000000; AR1 Percentage: 0.587337; AR1 Threshold: 0.500000; GradReference Optimizations 1. AR1 is not hidden in the BP phase. Optimize the segmentation policy
瓶颈识别:
AR与FPBP未并行;AR切分比例不合理;AR切分后未并行部分耗时占比超过阈值0.5时建议优化。
优化建议: