With the advent of deep learning, implementing an object detection system has become fairly trivial. There are a great many frameworks facilitating the process, and as I showed in a previous post, it’s quite easy to create a fast object detection model with YOLOv5.

However, understanding the basics of object detection is still quite difficult. It involves a lot of math, and the variable number of outputs/bounding boxes makes it harder to understand than image classification, where we know the number of outputs beforehand. With so many moving parts and new concepts introduced over the history of object detection, it certainly hasn’t gotten easier.

In this post, I’ll distill all this history into a simple guide that explains all the details of object detection and instance segmentation systems.

Contents:

- Classification+Localization
- Object Detection
- 1. R-CNN
- 2. Fast R-CNN
- 3. Faster R-CNN
- 4. YOLO

- Instance Segmentation
- Conclusion

## Introduction

The classic image classification problem is very well known: given an image, can you find the class the image belongs to?

We can solve any new image classification problem with ConvNets and transfer learning using pre-trained nets where Convnets are fixed feature extractors. That said, there are still lots of other interesting problems in the image domain.

We can divide these problems into 4 major buckets, with the image above as reference. I’ll put a concise description of each below, and then we’ll jump into a deep dive:

**Semantic Segmentation**: Given an image, can we classify each pixel as belonging to a particular class?**Classification+Localization**: We were able to classify an image as a cat, but can we also get the location of said cat by drawing a bounding box around it? Here we assume there are a fixed number of objects (usually 1) in the image.**Object Detection**: A more general case of the Classification+Localization problem. In a real-world setting, we won’t know how many objects are in the image beforehand. Can we detect all the objects in an image and draw bounding boxes around them?**Instance Segmentation**: Can we create masks for each individual object in an image? This differs from semantic segmentation in that if you look at the 4th image above, we won’t be able to distinguish between the two dogs using semantic segmentation because it will merge them.

As you can see, all the problems are something of a similar flavor with small differences. In this post I’ll focus mainly on **object detection and instance segmentation** as they are the most interesting. We’ll go through the four most famous techniques for object detection, and look at how they improved with time and new ideas.

## Classification+Localization

Let’s first understand how we can solve the problem when we have a single object in the image. This is the **classification+localization** case.

The answer is to treat localization as a regression problem.

### Input Data

First let’s talk about what sort of data the model expects. Normally in an image classification setting, we have data in the form (X,y), where X is the image and y is the class label.

In the classification+localization setting, our data in the form of (X,y) has X as the image, and y as an array containing (class_label, x,y,w,h). Here’s a breakdown:

x = bounding box top left corner x-coordinate

y = bounding box top left corner y-coordinate

w = width of the bounding box in pixels

h = height of the bounding box in pixels

### Model

In this setting, we create a *multi-output model* which takes an image as the input and has (n_labels + 4) output nodes. n_labels nodes for each of the output class and 4 nodes that give the predictions for (x,y,w,h).

### Loss

Normally, the loss is a weighted sum of the Softmax Loss (from the classification problem) and the regression L2 loss (from the bounding box coordinates).

Loss = alpha*Softmax_Loss + (1-alpha)*L2_Loss

Since these two losses are on a different scale, the alpha hyper-parameter needs to be tuned.

There is one thing to note here: we are trying to do an object localization task but still have our ConvNets in place. We add one more output layer to also predict the coordinates of the bounding box and tweak our loss function.

Herein lies the essence of the whole deep learning framework: stack layers on top of each other, reuse components to create better models, and create architectures to solve your problems. We’ll see a lot more of that going forward.

## Object Detection

So how does this idea of localization using regression get mapped to object detection? The short answer is, it doesn’t. We don’t have a fixed number of objects, so we can’t have 4 outputs denoting the bounding box coordinates.

One naive idea is to apply a convolutional neural network to many different crops of the image. The CNN classifies each crop as an object class or background class. This is intractable. There could be a lot of such crops you can create.

### Region Proposals:

If only there was a method which could find some smaller number of cropped regions for us automatically (a region proposal network). We could then run our convnet on those regions and be done with our object detection.

This is the basic idea behind RCNN, the first major success in object detection. It is also what selective search provided (Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013).

So what are region proposals?

- A region proposal finds “blobby” image regions that are likely to contain objects, and is relatively fast to run; e.g. Selective search gives 2000 region proposals in a few seconds on CPU.

But how exactly are the region proposals made?

### Selective Search for Object Recognition:

This paper finds regions in two steps. First, we start with a set of some initial regions (P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Graph Based Image Segmentation. IJCV, 59:167–181, 2004. 1, 3, 4, 5, 7):

Graph-based image segmentation techniques generally represent the problem in terms of a graph G = (V, E) where each node v ∈ V corresponds to a pixel in the image, and the edges in E connect certain pairs of neighboring pixels.

In the paper they take the following approach:

Each edge (vi , vj )∈ E has a corresponding weight w((vi , vj )), which is a non-negative

measure of the similaritybetween neighboring elements vi and vj. In the graph-based approach, a segmentation S is a partition of V into components such that each component (or region) C ∈ S corresponds to a connected component in a graph.

Put simply, the researchers use graph-based methods to find connected components in an image. The edges are made on some measure of similarity between pixels.

As you can see, if we create bounding boxes around these masks we’ll lose a lot of regions. We want the whole baseball player in a single bounding box/frame. This means we need to group these initial regions, which is the second step.

For this step, the authors of *Selective Search for Object Recognition* apply the Hierarchical Grouping algorithm to these initial regions. In this algorithm, they merge most similar regions together based on notions of similarity based on color, texture, size, and fill. This provides us with much better region proposals.

## 1. R-CNN

So now that we have our region proposals, how exactly do we use them in R-CNN?

Object detection system overview. The system:

(1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVM.

Along with this, there is a class-specific bounding box regressor that takes:

**Input:** (Px, Py, Ph, Pw) — the location of the proposed region.

**Target:** (Gx, Gy, Gh, Gw) — Ground truth labels for the region.

The goal is to learn a transformation that maps the proposed region (P) to the ground truth box (G).

### Training R-CNN

What is the input to an RCNN?

We have an image, region proposals from the RPN strategy, and the ground truths of the labels (labels, ground truth boxes). Next, we treat all region proposals with ≥ 0.5 IoU (Intersection over Union) overlap with a ground-truth box as a positive training example for that box’s class and the rest as negative. We train class-specific SVMs so every region proposal becomes a training example, and the ConvNet gives a feature vector for that region proposal. We can then train our n-SVMs using the class-specific data.

### Test Time R-CNN

At test time, we predict detection boxes using class-specific SVMs. We’ll get a lot of overlapping detection boxes at the time of testing, so non-maximum suppression is an integral part of the object detection pipeline.

First, it sorts all detection boxes on the basis of their scores. The detection box M with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold) are suppressed.

This process is recursively applied on all the remaining boxes until we are left with good bounding boxes only.

### Problems with RCNN:

- Training is slow.
- Inference (detection) is slow. 47s / image with VGG16 — Since the convnet needs to be run many times.

There’s a need for speed. That brings us to Fast R-CNN.

## 2. Fast R-CNN

Here’s the idea for Fast R-CNN: Why not create a convolutional map of the input image and then just select the regions from that convolutional map? Do we really need to run so many ConvNets? Let’s run a single convnet and apply region proposal crops on the features calculated by the convnet, then use a simple SVM/classifier to classify those crops.

That idea looks like this:

From the paper linked above: The figure illustrates the Fast R-CNN architecture. A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

**The basic idea is to run the convolution only once in the image, instead of the many convolution networks in R-CNN**. Then we can map the ROI proposals using some method, and filter the last convolution layer. All that’s left is to run a final classifier on that.

This idea depends upon the architecture of the model being used too, so the authors propose the following architecture:

We experiment with three pre-trained ImageNet [4] networks, each with five max pooling layers and between five and thirteen conv layers (see Section 4.1 for network details). When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations. First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16). Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors). Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

Don’t worry if you don’t understand the above just yet. It’s quite confusing, so let’s break it down. But to do that, we need to see the VGG16 architecture first.

The last pooling layer is 7x7x512. This is the layer the network authors intend to replace by the ROI pooling layers. This pooling layer has as input the location of the region proposal (xmin_roi,ymin_roi,h_roi,w_roi) and the previous feature map (14x14x512).

The location of ROI coordinates is in the units of the input image i.e. 224×224 pixels. However, the layer on which we have to apply the ROI pooling operation is 14x14x512.

Because we are using VGG, we transform the image (224 x 224 x 3) into (14 x 14 x 512), i.e. the height and width are divided by 16. We can map ROI coordinates onto the feature map by dividing them by 16.

In its depth, the convolutional feature map has encoded all the information for the image while maintaining the location of the “things” it has encoded relative to the original image. For example, if there was a red square on the top left of the image and the convolutional layers activate for it, then the information for that red square would still be on the top left of the convolutional feature map.

### What is ROI pooling?

Remember that the final classifier runs for each crop. Each crop needs to be of the same size, and that is what ROI Pooling does.

In the above image, our region proposal is (0,3,5,7) in x,y,w,h format. We divide that area into 4 regions since we want to have an ROI pooling layer of 2×2. We divide the whole area into buckets by rounding 5/2 and 7/2 and then just do a max-pool.

How do you do ROI-Pooling on areas smaller than the target size if the region proposal size is 5×5 and the ROI pooling layer size is 7×7? In this case, we resize to 35×35 by copying each cell seven times and max-pooling back to 7×7.

After replacing the pooling layer, the authors also replace the 1000 layer imagenet classification layer by a fully connected layer and softmax over K + 1 categories (+1 for Background) and category-specific bounding-box regressors.

### Training Fast-RCNN

What is the input to a Fast- RCNN? It’s similar to R-CNN: we have an image, region proposals from the RPN strategy, and the ground truths of the labels (labels, ground truth boxes)

Next, we treat all region proposals with ≥ 0.5 IoU (Intersection over Union) overlap with a ground-truth box as a positive training example for that box’s class and the rest as negative. This time we have a dense layer on top, and we use multi-task loss. So, every ROI becomes a training example. The main difference is that there is a concept of multi-task loss: a Fast R-CNN network has two sibling output layers.

The first outputs a **discrete probability distribution** (per RoI), p = (p0, . . . , pK), over K + 1 categories. As usual, p is computed by a softmax over the K+1 outputs of a fully connected layer.

The second sibling layer outputs **bounding-box regression offsets**, t= (tx, ty, tw, th), for each of the K object classes. Each training RoI is labelled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labelled RoI to jointly train for classification and bounding-box regression

In the above figure, Lcls is the softmax classification loss and Lloc is the regression loss. u=0 is for BG class and hence we add to loss only when we have a boundary box for any other class.

### Problem:

Region proposals still take up most of our time. Can we reduce the time taken for region proposals?

## 3. Faster-RCNN

The next question that pushed research to the next level was: Can the network do region proposals itself?

From the Faster RCNN paper: The intuition is that with FastRCNN we’re already computing an activation map in the CNN, so why not run the activation map through a few more layers to find the interesting regions, then finish the forward pass by predicting the classes + bbox coordinates?

### How does the region proposal network work?

One of the main ideas in the paper is the idea of anchors. **Anchors** are fixed bounding boxes placed throughout the image with different sizes and ratios that are used for reference when first predicting object locations.

First of all, we define anchor centers on the image.

The anchor centers are separated by 16 px in case the VGG16 network (as the final convolution layer of (14x14x512)) subsamples the image by a factor of 16 (224/14).

This is what anchors look like:

- We start with some predefined regions we think our objects could be with anchors.
- Our region proposal network (RPN) classifies which regions have the object and the offset of the object bounding box. Training is done using the same logic. 1 if IoU for anchor with bounding box>0.5 0 otherwise.
- Non-Maximum suppression to reduce region proposals.
- Fast RCNN detection network on top of proposals.

### Faster-RCNN Loss

The whole network is then jointly trained with 4 losses:

- RPN classify object / not object.
- RPN regress box coordinates offset.
- Final classification score (object classes).
- Final box coordinates offset.

### Performance

## 4. YOLO

Another architecture for object detection is YOLO by J Redmon, which came straight after Fast RCNN in 2016. YOLO stands for You Only Look Once. There have been multiple versions of this architecture but I will only talk about the first version as the others are derivatives.

From Paper: Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence

In YOLO the authors have once again reframed the object detection problem as a single regression problem. As per the paper:

Our system divides the input image into an S × S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B bounding boxes (x, y, w, h) and confidence (C) scores for those boxes.

The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

Each grid cell also predicts conditional class probabilities p = Pr(Class(i) |Object). These probabilities are conditioned on the grid cell containing an object.

So we have a SxS grid with each cell predicting B bounding boxes. Each bounding box then has 5 (x,y,w,h,C) + num_classes probability predictions (p) where:

**x, y**: center of the box relative to the grid cell

**w, h**: width and height of bounding box

**C**: IOU between the predicted box and ground truth box. So if the predicted bounding box and any real ground truth box have some area in common, that area comes in numerator and the union of the ground truth area and the bounding box area is in denominator to calculate the C while training.

**p(i) (where i ranges frm 0 to num_classes)**: Probability of class i given an object is present in the cell. As per the paper, if the center of an object falls into a grid cell, that grid cell contains the object.

Thus the output of YOLO is encoded as an S × S × (B ∗ 5 + num_classes) tensor. For PASCAL VOC, the authors use a S=7 thus a 7×7 grid with B = 2. As PASCAL VOC has 20 labelled classes, the final prediction is a 7 × 7 × 30 tensor.

The Loss function optimized is below:

It looks hard to understand but essentially the authors provided a higher weightage to boxes that contain an object vs boxes that don’t contain an object using λcoord = 5 and λnoobj = .5.

The boxes that have an object is indicated by the indicator variable. We iterate i from 1–49 as we use a 7×7 grid. 1^(obj)_i means that the ith cell contains an object. 1^obj_ij means that the jth bounding box for the cell contains the object. So if the cell 1st (out of 7×7 cells) predicted second bounding box (out of B predictions) really contains an object, the value of 1^obj_ij is equal to 1 else 0.

The first term essentially is the loss for x,y coordinates. The second term is loss for the predicted widths and heights. The third term is loss from the confidence scores for each bounding box. The fourth term is there because we also want to add a loss if we predict a high confidence for a box even when there is no object in the box. The fifth term is essentially the classification loss where p_i is 0 when a class is not present in the grid cell and 1 if a class is present. Also each grid cell can belong only to a single class. The p_i_hat is the prediction of the classes in a grid cell.

At prediction time, first the boxes are filtered using a confidence threshold (normally 0.6) and then NMS is applied to the boxes to get the final output.

## Instance Segmentation

Now comes the most interesting part: instance segmentation. Can we create masks for each individual object in the image? Specifically something like this:

## Mask-RCNN

The basic idea of the Mark R-CNN paper is to add another output layer that predicts the mask, and to use ROIAlign instead of ROIPooling.

Mask R-CNN adopts the same two-stage procedure with an identical first stage (RPN). In the second stage however, in parallel to predicting the class and box offset, Mask R-CNN **also** outputs a binary mask for each RoI.

### ROIAlign vs ROIPooling

In ROI pooling we lose the exact location-based information. See how we arbitrarily divided our region into 4 different sized boxes. For a classification task, it works well.

But for providing masks on a pixel level, we don’t want to lose this information. So we don’t quantize the pooling layer, and we use bilinear interpolation to find out values that properly align the extracted features with the input. See how 0.8 differs from 0.88

### Training

During training, we define a multi-task loss on each sampled RoI as

L = Lcls + Lbox + Lmask

The classification loss Lcls and bounding-box loss Lbox are identical as in Faster R-CNN. The mask branch has a K × m × m — dimensional output for each RoI, which encodes K binary masks of resolution m × m, one for each of the K classes.

To this, we apply a per-pixel sigmoid and define Lmask as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, Lmask is only defined on the kth mask (other mask outputs do not contribute to the loss)

### Mask Prediction

The mask layer is K × m × m dimensional where K is the number of classes. The m×m floating-number mask output is resized to the RoI size and binarized at a threshold of 0.5 to get final masks.

## Conclusion

In this post I tried to give a simple guide to some of the most important advancements in the field of object detection and instance segmentation. This is my own understanding of these papers with input from many blogs and slides available online.

Object detection is a vast field and there are a variety of other methods dominating it, including U-net, SSD and YOLO. There is no dearth of resources to learn and understand them, so I recommend taking a look at them all, especially now that you have a solid understanding of the basics.

If you’d like to learn how to implement an object detection system, be sure to check our dedicated article here. For more of Rahul’s technical articles, see the the related resources below and sign up to the Lionbridge AI newsletter for interviews and articles delivered directly to your inbox.