Fusion Support

During quantization, operator fusion is performed on some structures in the model. After the fusion, the graph structure is optimized, thereby improving the network inference performance. This section describes the fusion function in detail.

  • Caffe:
    • Conv+BN+Scale fusion: Before AMCT-based quantization, the "Convolution+BatchNorm+Scale" composite in the model is fused into "Conv+BN+Scale." The BatchNorm and Scale layers are removed.
    • Deconv+BN+Scale fusion: Before AMCT-based quantization, the "Deconvolution+BatchNorm+Scale" composite in the model is fused into "Deconv+BN+Scale." The BatchNorm and Scale layers are removed.
  • TensorFlow:
    • Conv+BN fusion: Before AMCT-based quantization, the "Conv2D/Conv3D+BatchNorm" composite in the model is fused into "Conv+BN". The BatchNorm layer is removed.
    • Depthwise_Conv+BN fusion: Before AMCT quantization, the "DepthwiseConv2dNative+BatchNorm" composite in the model is fused, during which the BatchNorm layer is removed.
    • Conv2DBackpropInput+BN fusion: Before AMCT-based quantization, the "Conv2DBackpropInput+BatchNorm" composite in the model is fused into "Conv2DBackpropInput+BN". The BatchNorm layer is removed.
    • Split+Conv+Concat fusion: Before AMCT-based quantization, the group convolution "Split+Conv (multiple)+Bias (optional)+Concat" composite in the model uses group convolution fusion, to reduce the computation workload and optimize the memory and performance of the quantized model. The Split operator includes Split and SplitV, the convolution operator includes Conv2d, and the Concat operator includes Concat and ConcatV2.

      Requirements: split_dim of the Split operator and concat_dim of the Concat operator must be the same as the index of the Cout axis of the convolution. The quantization configuration and operator attributes of each convolution operator must be the same.

  • ONNX:
    • Conv+BN fusion: Before AMCT-based quantization, the "Conv+BatchNormalization" composite in the model is fused into "Conv+BN". The BatchNorm layer is removed.
    • ConvTranspose+BN fusion: Before AMCT-based quantization, the "ConvTranspose+BatchNormalization" composite in the model is fused into "ConvTranspose+BN". The BatchNorm layer is removed.
  • Requant fusion: ReLU6 does not support Requant fusion and needs to be replaced with ReLU. The ReLU6 operator is identical to ReLU except an upper cutoff at 6 that is similar to the clipping in floating-point quantization. As such, AscendDequant+ReLU+AscendQuant, when equivalent, can replace AscendDequant+ReLU6+AscendQuant. This replacement can be done by AMCT.

    To make the replacement happen, the following condition must be met: (127 – offset)/scale < 6, where scale and offset are the quantization parameters extracted from quant.

  • BatchNorm+Mul+Add fusion (applicable to the TensorFlow and ONNX frameworks)

    Before quantization, AMCT performs BN+Mul fusion on the "BatchNorm+Mul" structure in the model. After the fusion, the Mul layer is deleted. Then, AMCT performs BN+Add fusion on the "BatchNorm+Add" structure in the model. After the fusion, the Add layer is deleted.

  • Small BN operators are fused into a large BatchNorm operator (applicable to the TensorFlow and ONNX frameworks).

    AMCT replaces the small BN operators with a matched large BatchNorm operator. The prerequisites for fusing small BN operators are as follows:

    1. The data node in the BN structure must be a quantizable node (Conv2D, DepthwiseConv2D, MatMulV2, or Conv3D).
    2. If a node other than data, scale, offset, mean, variance, and output node inside the structure is connected to a node outside the structure, fusion is not performed.

    For the BN structure without offset, an all-0 offset node is constructed during fusion. For the BN structure without scale, an all-1 scale node is constructed during fusion. The scenarios of small BN operators are as follows:

    • If offset is absent, is_training=False, data_format=NHWC, and the input node is of the Const type, the network structures before and after fusion are as follows:

    • If offset is absent, is_training=False, data_format=NCHW, and the input node is of the Const type, the network structures before and after fusion are as follows:

    • If scale and offset are absent, is_training=False, data_format=NHWC, and the input node is of the Const type, the network structures before and after fusion are as follows:

    • If scale and offset are absent, is_training=False, data_format=NCHW, and the input node is of the Const type, the network structures before and after fusion are as follows:

    • If scale is absent, is_training=False, data_format=NHWC, and the input node is of the Const type, the network structures before and after fusion are as follows:

    • If scale is absent, is_training=False, data_format=NCHW, and the input node is of the Const type, the network structures before and after fusion are as follows:

    • If is_training=False, data_format=NHWC, and the input node is of the Const type, the network structures before and after fusion are as follows:

    • If is_training=False, data_format=NCHW, and the input node is of the Const type, the network structures before and after fusion are as follows:

    • If offset is absent, is_training=False, data_format=NHWC, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • If offset is absent, is_training=False, data_format=NCHW, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • If scale and offset are absent, is_training=False, data_format=NHWC, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • If scale and offset are absent, is_training=False, data_format=NCHW, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • If scale is absent, is_training=False, data_format=NHWC, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • If scale is absent, is_training=False, data_format=NCHW, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • If is_training=False, data_format=NHWC, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • If is_training=False, data_format=NCHW, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • In a simplified BN structure, if offset is absent, data_format=NHWC, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • In a simplified BN structure, if offset is absent, data_format=NCHW, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • In a simplified BN structure, if scale and offset are absent, data_format=NHWC, and the input node is of the Variable type, the network structures before and after fusion are as follows:

    • In a simplified BN structure, if scale and offset are absent, data_format=NCHW, and the input node is of the Variable type, the network structures before and after fusion are as follows:

  • Small bn_branch operators are fused into a large BN operator.

    AMCT replaces the small bn_branch operators with a matched large BN operator. Fusion is not performed if any of the following scenarios exists inside the structure: The input data sources of two BN branches are different, the structure output is the training BN output, and the is_training parameter of inference BN is True. During operator fusion, the large inference BN operator in the original structure is reserved, redundant nodes of small operator structures such as Switch are deleted, and the input and output edges of the reserved BN operator are added again. In this way, small operators are fused into a large operator. The scenarios of small BN operators are as follows: