算子融合推荐
UB算子融合
输出潜在可合并算子组合列表,如图1。
====UBModel====
UB fusion operators need to be optimized
# 需要进行UB融合的算子
Identifications of UB fusion operators: the following operators can be used for UB fusion:
# UB融合算子识别:以下算子可以进行UB融合:
List of operators that can be fused in subgraph 0
# 可以在子图中融合的算子运算符列表
Fusion Type: DepthwiseConv2D+Mul; Fusion Operator Detail: block2a_dwconv/depthwiseblock2a_activation/Sigmoidblock2a_activation/mul, block2a_se_excite/mul; Fusion Operator Duration: 336.248993;
# Fusion Type:可融合算子类型;Fusion Operator Detail:可融合算子明细;Duration:可融合算子运行总时间。
Fusion Type: DepthwiseConv2D+Mul; Fusion Operator Detail: block2b_dwconv/depthwiseblock2b_activation/Sigmoidblock2b_activation/mul, block2b_se_excite/mul; Fusion Operator Duration: 284.063004;
Fusion Type: DepthwiseConv2D+Mul; Fusion Operator Detail: block3b_dwconv/depthwiseblock3b_activation/Sigmoidblock3b_activation/mul, block3b_se_excite/mul; Fusion Operator Duration: 220.571999;

输出结果会根据可融合算子运行总时间从大到小以及相同可融合算子类型进行排序。
优化建议:
建议根据输出结果将可融合算子进行融合。
首层算子融合
输出潜在可合并算子组合列表,如图2。
====AippFusionModel====
Fuse Cast/TransData with Conv needs to be optimized
# 需要进行Aipp首层算子融合的算子
Identifications of AIPP fusion operators: the following operators can be used for AIPP fusion:
# AIPP融合算子识别:以下算子可以进行AIPP融合:
List of operators that can be fused with aipp
# 可以在AIPP中融合的算子运算符列表
1. trans_Cast_0+trans_TransData_1+stem_conv/convolutionstem_activation/Sigmoidstem_activation/mul
优化建议:
建议根据输出结果将可融合算子进行融合。
L2融合(动态Batch切分)
====L2Model====
L2 fusion operators need to be optimized
# L2融合算子需要优化
Identifications of L2 fusion operators: the following operators can be used for L2 fusion:
# L2融合算子的识别:L2融合可以使用以下算子:
List of operators that can be fused in subgraph
# 可以在子图中融合的运算符列表
1. MobilenetV3/expanded_conv/project/Conv2DMobilenetV3/expanded_conv/add, MobilenetV3/expanded_conv_1/expand/Conv2DMobilenetV3/expanded_conv_1/expand/Relu, MobilenetV3/expanded_conv_1/depthwise/depthwiseMobilenetV3/expanded_conv_1/depthwise/Relu, MobilenetV3/expanded_conv_1/project/Conv2D
Op Info
# 算子信息
Op Name: MobilenetV3/expanded_conv/project/Conv2DMobilenetV3/expanded_conv/add; OP Type: Conv2D; Input Shapes: 8,1,112,112,16;1,1,16,16;16;8,1,112,112,16; mac_ratio: 0.063851; vec_ratio: 0.635701; mte2_ratio: 0.980165; mte3_ratio: 0.547420; Hit Rate: 0.989344;
Op Name: MobilenetV3/expanded_conv_1/expand/Conv2DMobilenetV3/expanded_conv_1/expand/Relu; OP Type: Conv2D; Input Shapes: 8,1,112,112,16;1,4,16,16;64; mac_ratio: 0.066542; vec_ratio: 0.608959; mte2_ratio: 0.880165; mte3_ratio: 0.973353; Hit Rate: 0.982700;
Op Name: MobilenetV3/expanded_conv_1/depthwise/depthwiseMobilenetV3/expanded_conv_1/depthwise/Relu; OP Type: DepthwiseConv2D; Input Shapes: 8,4,112,112,16;4,3,3,1,16,16;64; mac_ratio: 0.175088; vec_ratio: 0.119661; mte2_ratio: 0.879285; mte3_ratio: 0.185707; Hit Rate: 0.930730;
Op Name: MobilenetV3/expanded_conv_1/project/Conv2D; OP Type: Conv2D; Input Shapes: 8,4,56,56,16;4,2,16,16;24; mac_ratio: 0.276653; vec_ratio: 0.427420; mte2_ratio: 0.877128; mte3_ratio: 0.668421; Hit Rate: 0.989344;
Recommadation
# 优化建议
1. Open AutoTune
# 开启AutoTune功能。
AutoTune功能开启后能在计算时将模型中所有非1batch的算子自动切分为1batch,从而降低每层算子的数据量,解决L2 cache空间不足,产生DDR写回,引发算子性能恶化的问题。有关AutoTune功能的使用请参见《CANN 开发工具指南》中的“Auto Tune工具使用指南”章节。