Custom Operator Description

Custom operators can be implemented in deep learning frameworks including Caffe, TensorFlow, and MindSpore. The first type of operators is customized from the Caffe framework, such as the ROIPooling, PSROIPooling, Normalize and Upsample layers. In other words, these operators have been defined in Caffe's .prototxt files. For operators customized from frameworks other than Caffe, it is also necessary to give corresponding definitions in .prototxt format.

Reverse

Reverses the dimensions of a tensor, for example, from [1, 2, 3] to [3, 2, 1].

Define the operator as follows:

Add ReverseParameter to LayerParameter.

message LayerParameter {
...
optional ReverseParameter reverse_param = 157;
...
}

Define the data types and attributes of ReverseParameter.
```
message ReverseParameter{
  repeated int32 axis = 1;
}
```

ROIPooling

The major hurdle for going from image classification to object detection is fixed size input requirement to the network because of the existing fully connected (FC) layers. In object detection, different proposals have different shapes. Therefore, it is necessary to convert all the proposals to a static shape as required by FC layers.

Region of Interest pooling (ROIPooling) is used for utilizing a single feature map for all the generated proposals in a single pass. ROIPooling solves the problem of fixed image size requirement for object detection network.

You need to extend the caffe.proto file and define ROIPoolingParameter as follows:

spatial_scale: multiplicative spatial scale factor to translate ROI coordinates from their input scale to the scale used when pooling
pooled_h and pooled_w: height and width of the ROI output feature map

Add ROIPoolingParameter to LayerParameter.

message LayerParameter {
...
optional ROIPoolingParameter roi_pooling_param = 161;
...
}

Define the data types and attributes of ROIPoolingParameter.

message ROIPoolingParameter {
    required int32 pooled_h = 1;
    required int32 pooled_w = 2;
    optional float spatial_scale = 3 [default=0.0625];
    optional float spatial_scale_h = 4;
    optional float spatial_scale_w = 5;
}

Example .prototxt definition of ROIPooling:

layer {
name: "roi_pooling"
type: "ROIPooling"
bottom: "res4f"
bottom: "rois"
bottom: "actual_rois_num"
top: "roi_pool"
roi_pooling_param {
  pooled_h: 14
  pooled_w: 14
  spatial_scale:0.0625
  spatial_scale_h:0.0625
  spatial_scale_w:0.0625
  }
}

PSROIPooling

Position Sensitive ROI Pooling (PSROIPooling) works in a similar way to ROIPooling. However, unlike ROIPooling, the feature map output from PSROIPooling is obtained from different feature map channels, and average pooling (instead of max-pooling in ROIPooling) is performed on each divided bin.

PSROIPooling divides the ROI into k*k bins and outputs a k*k feature map. The number of output channels for pooling is the same as the number of input channels.

You need to extend the caffe.proto file and define PSROIPoolingParameter as follows:

spatial_scale: multiplicative spatial scale factor to translate ROI coordinates from their input scale to the scale used when pooling
output_dim: number of output channels
group_size: number of groups to encode position-sensitive score maps, that is, k

Add PSROIPoolingParameter to LayerParameter.

message LayerParameter {
...
optional PSROIPoolingParameter psroi_pooling_param = 207;
...
}

Define the data types and attributes of PSROIPoolingParameter.

message PSROIPoolingParameter {
   required float spatial_scale = 1;
   required int32 output_dim = 2; // output channel number
   required int32 group_size = 3; // number of groups to encode position-sensitive score maps
}

Example .prototxt definition of PSROIPooling:

layer {    
  name: "psroipooling"    
  type: "PSROIPooling"
  bottom: "some_input"    
  bottom: "some_input"    
  top: "some_output"    
  psroi_pooling_param {      
    spatial_scale: 0.0625      
    output_dim: 21      
    group_size: 7   
  }  
}

Upsample

The Upsample layer is the reverse of the Pooling layer. Each decoder upsamples the activations generated by the corresponding encoder.

You need to extend the caffe.proto file and define UpsampleParameter as follows. The stride parameter is the upsampling factor, for example, 2.

Add UpsampleParameter to LayerParameter.

message LayerParameter {
...
optional UpsampleParameter upsample_param = 160;
...
}

Define the data types and attributes of UpsampleParameter.

message UpsampleParameter{
    optional float scale = 1[default = 1];
    optional int32 stride = 2[default = 2];
    optional int32 stride_h = 3[default = 2];
    optional int32 stride_w = 4[default=2];
}

Example .prototxt definition of Upsample:

layer {
    name: "layer86-upsample"
    type: "Upsample"
    bottom: "some_input"
    top: "some_output"
    upsample_param {
		scale: 1
                stride: 2
    }
}

Normalize

The Normalize layer is a normalization layer in the SSD network, and is mainly used to normalize elements in a space or a channel to the range [0, 1]. The Normalize layer is to output a tensor of a same size for a c*h*w three-dimensional tensor. As shown in the following formula, Normalize is calculated based on the square root of the sum of squares in the channel direction for each element.

$\text{[math]}$

where the cumulative vector of the square sum in the denominator part is the sum of the channel vectors that share the same height and width, as the orange part shown in Figure 1.

Figure 1 Normalize diagram

After the preceding normalization calculation, the Normalize layer scales each feature map using separate scale factors.

You need to extend the caffe.proto file and define NormalizeParameter as follows:

across_spatial: a bool. If True, normalizes every channel to 1 x c x h x w. If False, normalizes every pixel to 1 x c x 1 x 1.
channels_shared: a bool. If True, the scale parameters are shared across channels. Defaults to True.
eps: (optional) a small number to avoid division by zero while normalizing. Defaults to 1e-10.

The mathematical formulation of Normalize is as follows.

$\text{[math]}$

Define the operator as follows:

Add NormalizeParameter to LayerParameter.

message LayerParameter {
...
optional NormalizeParameter norm_param = 206;
...
}

Define the data types and attributes of NormalizeParameter.

message NormalizeParameter {
  optional bool across_spatial = 1 [default = true];
  // Initial value of scale. Default is 1.0 for all
  optional FillerParameter scale_filler = 2;
  // Whether or not scale parameters are shared across channels.
  optional bool channel_shared = 3 [default = true];
  // Epsilon for not dividing by zero while normalizing variance
  optional float eps = 4 [default = 1e-10];
}

Example .prototxt definition of Normalize:

layer {
  name: "normalize_layer"
  type: "Normalize"
  bottom: ""some_input"
  top: "some_output"
  norm_param {
    across_spatial: false
    scale_filler {
      type: "constant"
      value: 20
    }
    channel_shared: false
  }
}

Reorg

The Reorg operator is implemented as a PassThrough operator in Ascend AI Processor, which rearranges blocks of spatial data into depth, or vice versa.

The PassThrough layer is a custom layer in the YOLOv2. Because YOLOv2 is not implemented under the Caffe framework, there is no standard Caffe definition for this layer. The PassThrough layer concatenates the higher resolution features with lower one by stacking adjacent features into different channels instead of spatial locations.

Define the operator as follows:

Add ReorgParameter to LayerParameter.

message LayerParameter {
...
 optional ReorgParameter reorg_param = 155;
...
}

Define the data types and attributes of ReorgParameter.

message ReorgParameter{
    optional uint32 stride = 2 [default = 2];
    optional bool reverse = 1 [default = false];
}

Example .prototxt definition of Proposal:

layer {
    bottom: "some_input"
    top: "some_output"
    name: "reorg"
    type: "Reorg"
    reorg_param {
        stride: 2
    }
}

Proposal

The Proposal operator is used to get the accurate proposals based on the foreground of rpn_cls_prob and the refined anchors obtained through bounding box regression of rpn_bbox_pred.

Three operators are used: decoded_bbox, topk, and nms, as shown in Figure 2.

Figure 2 Proposal implementation

Define the operator as follows:

Add ProposalParameter to LayerParameter.

message LayerParameter {
...
 optional ProposalParameter proposal_param = 201;
...
}

Define the data types and attributes of ProposalParameter.

message ProposalParameter {
    optional float feat_stride = 1 [default = 16];
    optional float base_size = 2 [default = 16];
    optional float min_size = 3 [default = 16];
    repeated float ratio = 4;
    repeated float scale = 5;
    optional int32 pre_nms_topn = 6 [default = 3000];
    optional int32 post_nms_topn = 7 [default = 304];
    optional float iou_threshold = 8 [default = 0.7];
    optional bool output_actual_rois_num = 9 [default = false];
}

Example .prototxt definition of Proposal:

layer {
name: "faster_rcnn_proposal"
type: "Proposal"                 //Operator type


bottom: "rpn_cls_prob_reshape"
bottom: "rpn_bbox_pred"
bottom: "im_info"
top: "rois"
top: "actual_rois_num"       // Added operator output
  proposal_param {
	  
  feat_stride: 16
  base_size: 16
  min_size: 16
  pre_nms_topn: 3000
  post_nms_topn: 304
  iou_threshold: 0.7
  output_actual_rois_num: true
  }
}

If your network model contains the "Proposal+ROIAlign+Service operator" structure, append a Reshape operator to the Proposal operator to change the tensor shape because Proposal outputs 3D tensors, while ROIAlign requires 2D rois inputs. However, the coordinates of the rois data input to the ROIAlign operator are disordered, which does not meet the requirements of the ROIAlign operator.

To solve the preceding problem, you need to append a Permute operator to Reshape to perform transposition.

An example of the modified network structure is as follows.

The following is a code example in the modified .prototxt file:

layer {
  name: 'proposal'
  type: 'Proposal'
  bottom: 'rpn_cls'
  bottom: 'rpn_loc'
  bottom: 'img_info'
  top: 'roi_proposal'
  proposal_param {
    feat_stride: 16
    pre_nms_topn: 1000
    post_nms_topn: 16
    nms_thresh: 0.7
    base_size: 16
    min_size: 8
    ratio: [0.5, 1.0, 2.0]
    scale: [32, 64, 128, 256, 512]
  }
}
layer {
    name: "Reshape1"
    type: "Reshape"
    bottom: "roi_proposal"
    top: "roi_proposal_reshape"
    reshape_param {
      shape {
        dim: 5
        dim: 16
      }
    }
}
layer {
    name: "Permute1"
    type: "Permute"
    bottom: "roi_proposal_reshape"
    top: "roi_proposal_permute"
	permute_param {
		order: 1
		order: 0
	}
}
layer {
  name: "align"
  type: "ROIAlign"
  bottom: "111"
  bottom: "roi_proposal_permute"
  top: "align"
  roi_align_param {
    pooled_w: 14
    pooled_h: 14
    spatial_scale: 0.0625
  }
}

ROIAlign

ROIAlign is a regional feature aggregation method proposed by Mask-RCNN, which solves the problem of misalignment caused by two quantization operations in ROIPooling operation.

The size of the feature map after pooling is pooled_w×pooled_h. Each ROI is divided into sampling_ratio×sampling_ratio grids of the same size. The grid points are the sampling points. As shown in Figure 3, the dashed line indicates the feature map, and the solid line indicates the ROI, which is divided into 2 x 2 cells. Assuming that the number of sampling points is 4, it means that four grids are equally divided, each of which takes its center point position. The pixel of the center point position (denoted by four arrows in Figure 3) is calculated by bilinear interpolation method. Finally, average the four pixel values as the ROIAlign result.

Figure 3 ROIAlign diagram

Define the operator as follows:

Add ROIAlignParameter to LayerParameter.

message LayerParameter {
...
 optional ROIAlignParameter  roi_align_param = 154;
...
}

Define the data types and attributes of ROIAlignParameter.

message ROIAlignParameter {
  // Pad, kernel size, and stride are all given as a single value for equal
  // dimensions in height and width or as Y, X pairs.
  optional uint32 pooled_h = 1 [default = 0]; // The pooled output height
  optional uint32 pooled_w = 2 [default = 0]; // The pooled output width
  // Multiplicative spatial scale factor to translate ROI coords from their
  // input scale to the scale used when pooling
  optional float spatial_scale = 3 [default = 1];
  optional int32 sampling_ratio = 4 [default = -1];
  optional int32 roi_end_mode = 5 [default = 0];
}

You can customize the .prototxt file based on the preceding data types and attributes.

ShuffleChannel

ShuffleChannel groups and permutes data in the channel dimension of the input.

For example, if channel = 4 and group = 2, ShuffleChannel transposes channel[1] and channel[2].

Define the operator as follows:

Add ShuffleChannelParameter to LayerParameter.

message LayerParameter {
...
 optional ShuffleChannelParameter shuffle_channel_param = 159;
...
}

Define the data types and attributes of ShuffleChannelParameter.

message ShuffleChannelParameter{
    optional uint32 group = 1[default = 1]; // Group number
}

Example .prototxt definition of ShuffleChannel:

layer {
  name: "layer_shuffle"
  type: "ShuffleChannel"
  bottom: "some_input"
  top: "some_output"
  shuffle_channel_param {
    group: 3
  }
}

YOLO

The YOLO operator is introduced to the YOLOv2 network and is applied only on the YOLOv2 and YOLOv3 networks. It performs sigmoid and softmax operations on the input.

In YOLOv2, there are four scenarios based on the background and softmax parameters:
1. background = false, softmax = true:
  sigmoid is performed on (x, y) in (x, y, h, w), sigmoid is performed on b, and softmax is performed on classes.
2. background = false, softmax = false:
  sigmoid is performed on (x, y) in (x, y, h, w), sigmoid is performed on b, and sigmoid is performed on classes.
3. background = true, softmax = false:
  sigmoid is performed on (x, y) in (x, y, h, w), b is ignored, and sigmoid is performed on classes.
4. background = true, softmax = true:
  sigmoid is performed on (x, y) in (x, y, h, w), and softmax is performed on b and classes.

In YOLOv3, there is only one scenario: sigmoid is performed on (x, y) in (x, y, h, w), as well as b and classes.

The input shape is Tensor(n, coords+backgroup+classes,l.h,l.w), where n indicates the number of anchor boxes and coords indicates x, y, w, and h.

Define the operator as follows:

Add YoloParameter to LayerParameter.

message LayerParameter {
...
 optional YoloParameter yolo_param = 199;
...
}

Define the data types and attributes of YoloParameter.

message YoloParameter {
  optional int32 boxes = 1 [default = 3];
  optional int32 coords = 2 [default = 4];
  optional int32 classes = 3 [default = 80];
  optional string yolo_version = 4 [default = "V3"];
  optional bool softmax = 5 [default = false];
  optional bool background = 6 [default = false];
  optional bool softmaxtree = 7 [default = false];
}

Example .prototxt definition of YOLO:

layer {
	bottom: "layer82-conv"
	top: "yolo1_coords"
	top: "yolo1_obj"
	top: "yolo1_classes"
	name: "yolo1"
	type: "Yolo"
	yolo_param {
		boxes: 3
		coords: 4
		classes: 80  
		yolo_version: "V3"
		softmax: true
		background: false
    }
}

PriorBox

The prior box is generated according to the arguments.

The following uses conv7_2_mbox_priorbox as an example. The definition is as follows:

layer{
    name:"conv7_2_mbox_priorbox"
    type:"PriorBox"
    bottom:"conv7_2"
    bottom:"data"
    top:"conv7_2_mbox_priorbox"
    prior_box_param{
        min_size:162.0
        max_size:213.0
        aspect_ratio:2
        aspect_ratio:3
        flip:true
        clip:false
        variance:0.1
        variance:0.1
        variance:0.2
        variance:0.2
        img_size:300
        step:64
        offset:0.5
    }
}

A prior box is generated when the width and height are both min_size.
If max_size is available, sqrt(min_size*max_size) is used to determine the width and height of a generated box (max_size > min_size).
The prior box is generated based on aspect_ratio. Specifically, if aspect_ratio is defined as 2 or 3 and flip is set to true, then a prior box with an aspect ratio of 1/2 or 1/3 is automatically added.

Therefore, num_priors (number of prior boxes) = min_size + aspect_ratio (4 in this case) x min_size (1 in this case) + max_size

Define the operator as follows:

Add PriorBoxParameter to LayerParameter.

message LayerParameter {
...
 optional PriorBoxParameter prior_box_param = 203;
...
}

Define the data types and attributes of PriorBoxParameter.

message PriorBoxParameter {
  // Encode/decode type.
  enum CodeType {
    CORNER = 1;
    CENTER_SIZE = 2;
    CORNER_SIZE = 3;
  }
  // Minimum box size (in pixels). Required!
  repeated float min_size = 1;
  // Maximum box size (in pixels). Required!
  repeated float max_size = 2;
  // Various of aspect ratios. Duplicate ratios will be ignored.
  // If none is provided, we use default ratio 1.
  repeated float aspect_ratio = 3;
  // If true, will flip each aspect ratio.
  // For example, if there is aspect ratio "r",
  // we will generate aspect ratio "1.0/r" as well.
  optional bool flip = 4 [default = true];
  // If true, will clip the prior so that it is within [0, 1]
  optional bool clip = 5 [default = false];
  // Variance for adjusting the prior bboxes.
  repeated float variance = 6;
  // By default, we calculate img_height, img_width, step_x, step_y based on
  //bottom[0] (feat) and bottom[1] (img). Unless these values are explicitly
  // provided.
  // Explicitly provide the img_size.
  optional uint32 img_size = 7;
  // Either img_size or img_h/img_w should be specified; not both.
  optional uint32 img_h = 8;
  optional uint32 img_w = 9;

  // Explicitly provide the step size.
  optional float step = 10;
  // Either step or step_h/step_w should be specified; not both.
  optional float step_h = 11;
  optional float step_w = 12;

  // Offset to the top-left corner of each cell.
  optional float offset = 13 [default = 0.5];
}

Example .prototxt definition of PriorBox:

layer {
  name: "layer_priorbox"
  type: "PriorBox"
  bottom: "some_input"
  bottom: "some_input"
  top: "some_output"
  prior_box_param {
    min_size: 30.0
    max_size: 60.0
    aspect_ratio: 2
    flip: true
    clip: false
    variance: 0.1
    variance: 0.1
    variance: 0.2
    variance: 0.2
    step: 8
    offset: 0.5
  }
}

SpatialTransformer

This operator performs affine transformation in the computation process. If you need only one set of parameters of affine transformation, define them in the .prototxt file and use them for multiple batches. Alternatively, use dynamic parameters as the second input of the operator layer. In this way, parameters for each batch are different.

The procedure is as follows:

Convert the output coordinates into values in the range of [–1, 1] by using the following formulas.

$\text{[math]}$

The corresponding code is as follows:

Dtype* data = output_grid.mutable_cpu_data();
         for(int i=0; i< output_H_ * output_W_; ++i) {
                   data[3 * i] = (i / output_W_) * 1.0 / output_H_ * 2 - 1;
                   data[3 * i + 1] = (i % output_W_) * 1.0 / output_W_ * 2 - 1;
                   data[3 * i + 2] = 1;
         }

Perform affine transformation to convert the output coordinates into input coordinates. In the following formula, s indicates input coordinates, and t indicates output coordinates.
$\text{[math]}$

The corresponding code is as follows:
```
caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasTrans, output_H_ * output_W_, 2, 3, (Dtype)1.,
      output_grid_data, full_theta_data + 6 * i, (Dtype)0., coordinates);
```

Obtain the value of a specific position based on the input coordinates and assign the value to the corresponding output position.

The output coordinates are converted in step 1. Therefore, you need to convert the input coordinates in the same way. The following is a code example.

Dtype x = (px + 1) / 2 * H;
Dtype y = (py + 1) / 2 * W;
         if(debug) std::cout<<prefix<<"(x, y) = ("<<x<<", "<<y<<")"<<std::endl;
 
  for(int m = floor(x); m <= ceil(x); ++m)
    for(int n = floor(y); n <= ceil(y); ++n) {
             if(debug) std::cout<<prefix<<"(m, n) = ("<<m<<", "<<n<<")"<<std::endl;
             if(m >= 0 && m < H && n >= 0 && n < W) {
                        res += (1 - abs(x - m)) * (1 - abs(y - n) * pic[m * W + n]);                           
                       if(debug) std::cout<<prefix<<" pic[m * W + n]= "<<std::endl;
             }
    }

Define the operator as follows:

Add SpatialTransformParameter to LayerParameter.

message LayerParameter {
...
 optional SpatialTransformParameter spatial_transform_param = 153;
...
}

Define the SpatialTransformParameter class and attribute parameters.

message SpatialTransformParameter {
  optional uint32 output_h = 1 [default = 0];
  optional uint32 output_w = 2 [default = 0];
  optional float border_value = 3 [default = 0];
  repeated float affine_transform = 4;
  enum Engine {
    DEFAULT = 0;
    CAFFE = 1;
    CUDNN = 2;
  }
  optional Engine engine = 15 [default = DEFAULT];
}

Example .prototxt definition of SpatialTransform:

layer {
  name: "st_1"
  type: "SpatialTransformer"
  bottom: "data"
  bottom: "theta"
  top: "transformed"
  st_param {
    to_compute_dU: false
    theta_1_1: -0.129
    theta_1_2: 0.626
    theta_2_1: 0.344
    theta_2_2: 0.157
  }
}

Parent topic: Custom Caffe Network Modification