Model Architectures

YOLOv3

  • The acceptable model has output tensors similar to those of YOLOv3. There are three output tensors (YOLOv3-Tiny has two output pools, and the YOLO_TYPE parameter needs to be set to 2), which are the feature layers after down sampling at 8x, 16x, and 32x.
  • The first dimension of each output tensor is the same as the maximum batch size supported by the model. The shape of the output tensor varies slightly based on NHWC or NCHW. W and H are equal to the width and height of the model input divided by 8, 16, and 32, respectively. C is equal to the number of prior boxes: anchorDim = 3 x (border coordinate 4 + border confidence 1 + number of classes 80).
  • When MODEL_TYPE is set to 0, NHWC is used. When MODEL_TYPE is set to 1, NCHW is used.
Figure 1 NHWC
Figure 2 NCHW

FasterRCNN

  • Two types of models are supported: the native Faster R-CNN model on GitHub and the model after the NMS is tailored.
  • If NMS_FINISHED is set to 0, the latter is used. If NMS_FINISHED is set to 1, the former is used.
  • Native model

    There are four output tensors, which are the number of objects, confidence, coordinate box, and class ID.

    Figure 3 Native Faster R-CNN model
  • Model after the NMS is tailored

    There are three output tensors, which are the number of objects, possible coordinate boxes of objects of each class, and confidence of object frames of each class.

    Figure 4 Faster R-CNN model after the NMS is tailored

SSD MobileNet v1 FPN

Similar to the native Faster R-CNN model, SSD MobileNet v1 FPN has four output tensors, which are the number of objects, confidence, coordinate box, and class ID.

Figure 5 SSD MobileNet v1 FPN

SSD-VGG16

SSD-VGG16 has two output tensors. The first output tensor is the number of objets, and the second one is the object frame information [batch, keep_top_k, 8]. 8 indicates batchID, label (classID), and score (class probability). xmin, ymin, xmax, ymax, and null.

Figure 6 SSD-VGG16

CRNN

This model has only one output tensor. The first dimension is batch size, and the second dimension is the upper limit of the number of objects that can be detected, indicating the class ID (including the placeholder) of each object identified by the model.

Figure 7 CRNN

ResNet-50

This model has only one output tensor. The first dimension is batch size, and the second dimension is the same as the number of classes, which is the result after softmax operation at the model feature layer. The second output tensor is the class ID corresponding to the class with the highest probability.

Figure 8 ResNet-50

YOLOv4

Similar to YOLOv3, YOLOv4 has three output tensors, which are feature layers after 8x, 16x, and 32x downsampling, respectively.

Figure 9 YOLOv4

YOLOv5

  • YOLOv5 has three output tensors, which are feature layers after 8x, 16x, and 32x downsampling, respectively.
  • The output tensors are arranged in the format of N(C0)HW(C1). W and H are equal to the width and height of the model input divided by 8, 16, and 32, respectively. C is equal to the number of prior boxes: anchorDim = 3 x (border coordinate 4 + border confidence 1 + number of classes 80).
Figure 10 YOLOv5

FasterRCNN-FPN/CascadeRCNN-FPN

The model has two output tensors: 5 x 100 bounding box and confidence (x0, y0, x1, y1, confidence). The coordinates are those of the upper left and lower right bounding boxes. 1 x 100 is the score of each class. The input is an RGB image with a fixed size of 3 x 1216 x 1216.

Figure 11 FasterRCNN-FPN/CascadeRCNN-FPN

CTPN (TensorFlow)

The CTPN (TensorFlow) model has two output tensors, which are 38 x 67 x 40 small bounding boxes. That is, 10 small bounding boxes are generated for each pixel of 38 x 67 x 4. A prediction score of 38 x 67 x 20 is equivalent to 10 prediction scores generated for each pixel of 38 x 67 x 2. The input is an RGB image with a fixed size of 3 x 608 x 1072.

Figure 12 CTPN (TensorFlow)

CTPN (MindSpore)

The CTPN (MindSpore) model has two output tensors: 1000 (small bounding boxes) and 5 (four coordinates and score). The other 1000 is the class of each box, 1 for foreground and 0 for background. The input is an RGB image with a fixed size of 3 x 576 x 960.

Figure 13 CTPN (MindSpore)

ResNet-18+

The input of the ResNet-18+ model is a tensor of the size of 1 x 408 x 64 x 3. The output is a tensor of 1 x 2, indicating the classification probability of each sample.

Figure 14 ResNet-18+

BERT-Base-Uncased

The BERT-Base (Uncased) model has three tensors with the shape of 1 x 128, where 1 indicates the batch size and 128 indicates the sentence length.

The model output has a tensor with the shape of 1 x 2, indicating the probability of each class.

Figure 15 BERT-Base-Uncased

DeeplabV3+ (TensorFlow)

The DeeplabV3+ (TensorFlow) model has one output tensor, which is the NHWC layout of 1 x 513 x 513 x 21. The physical meaning is the classification probability of each pixel. The original input image is an RGB image with a dynamic shape. The model input is a tensor of the size of 1 x 513 x 513 x 3.

Figure 16 DeeplabV3+ (TensorFlow)

DeepLabV3 (MindSpore)

The DeepLabV3 (MindSpore) model input is in NHWC format, and the output is in NCHW format.

Figure 17 DeepLabV3 (MindSpore)

DeepLabV3 (PyTorch)

The DeepLabV3 (PyTorch) model input is in NHWC format, and the output is in NCHW format.

Figure 18 DeepLabV3 (PyTorch)

U-Net (MindSpore)

The output tensor of the U-Net (MindSpore) model is in NCHW format, where N is 1 and C is 2. It is used as the post-processing input of the SDK.

  1. Perform argmax on C channels to obtain the index of the maximum probability values, and generate a two-dimensional array whose values are 0 and 1.
  2. Check whether the HW of the tensor output by the model is the same as the size of the original input image. If yes, a two-dimensional array after argmax is directly output. Otherwise, the nearest neighbor interpolation is performed to obtain the size of the input image.
Figure 19 U-Net (MindSpore)

Mask R-CNN (TensorFlow)

The input tensor of the Mask R-CNN (TensorFlow) model is in the 4-dimensional NHWC (1 x 480 x 640 x 3) layout format.

  • N indicates the number of batches.
  • H indicates the height of the input image (480 pixels).
  • W indicates the width of the input image (640 pixels).

The model has five output tensors (tensor[0] to tensor[4]).

  • tensor[0] has only one dimension. The length of the first dimension is 1, indicating the number of objects detected by the model.
  • tensor[1] has two dimensions (1 x 100). The length of the first dimension is 1, indicating the number of batches. The length of the second dimension is 100, indicating the confidence scores of the top 100 objects.
  • tensor[2] has three dimensions (1 x 100 x 4). The length of the first dimension is 1, indicating the number of batches. The length of the second dimension is 100, indicating the top 100 object frames. The length of the third dimension is 4, indicating the coordinates (x0, y0, x1, and y1) of the four vertices of the object frame.
  • Tensor[3] has four dimensions (1 x 100 x 33 x 33). The length of the first dimension is 1, indicating the number of batches. The length of the second dimension is 100, indicating the top 100 object frames. The third and fourth dimensions represent a 33 x 33 mask image.
  • tensor[4] has two dimensions (1 x 100). The length of the first dimension is 1, indicating the number of batches. The length of the second dimension is 100, indicating the classification classes of the top 100 objects.
Figure 20 Mask R-CNN (TensorFlow)

FaceNet (TensorFlow)

The shape of the FaceNet (TensorFlow) input tensor is NHWC (1 x 160 x 160 x 3), and the data type is UINT8.

  • N indicates the number of batches.
  • H indicates the height of the input image (160 pixels).
  • W indicates the width of the input image (160 pixels).
  • C indicates the number of image channels.

The output tensor is the feature vector corresponding to the object image. The shape is 1 x 512. The first dimension indicates the number of batches, and the second dimension indicates the length of the feature vector (512). The data type is FLOAT32.

Figure 21 FaceNet (TensorFlow)

SSD MobileNet v1 FPN (MindSpore)

The SSD MobileNet v1 FPN (MindSpore) model has two output tensors, which are the coordinate box and confidence.

Figure 22 SSD MobileNet v1 FPN (MindSpore)

OpenPose

The OpenPose output tensor is [batch, outputHeight, outputWidth, channel]. outputHeight indicates the height of the output image. outputWidth indicates the width of the output image. The channel consists of two parts. The first one third is the heat mat, and the last two thirds are the paf mat. The output of this model is [1, 54, 46, 57].

Figure 23 OpenPose

Unet++ (MindSpore)

The post-processing module of the Unet++ (MindSpore) has an NCHW tensor, which is input after being processed by AIPP after the argmax operation. The module also has an NHW output tensor. Because the argmax operation has been performed on the model, the C channel has been included in the value of tensor HW.

Figure 24 Unet++ (MindSpore)