Object Detection: A Computer Vision Methodology

Similoluwa Fiyinfoluwa
7 min readFeb 26, 2022


During the covid-19 pandemic, people gaining access to events was most times dependent on machines or gadgets that simply scans their faces to check if they were putting on nose-masks or not. Ever wondered how a machine can tell that through a simple scan? Object Detection is the answer to all of your questions.

In this article, you’ll learn about:

  • What is Object Detection
  • How Object Detection works
  • Models used for Object Detection
  • Applications of Object Detection

What is Object Detection

Object detection is a computer vision technique that identifies and locates objects within an image or video. To be precise, object detection draws bounding boxes around these detected objects, which allow us to locate where said objects are in a given scene/image.

For example:

Object Detection is often confused with Image Recognition, so here’s the difference:

The major difference between Image recognition and Object Detection is that Object Detection makes use of bounding boxes to identify the objects in an image/scene while Image recognition just labels the whole image as an object, even where there are more one of this object. This means Object Detection is more detailed than Image Recognition.

How Object Detection works

Object Detection puts a bounding box around detected objects. This allows to tell the exact location of the objects in a given scene.

The following are methods Object Detection works:

  • Object detection with deep learning

Common examples are the CNN family, such as R-CNN and YOLO.

There are two key approaches to get started:

  1. Create and train a custom object detector; to train a custom object detector, you need to design a network architecture to learn the features for the objects of interest. You also need to compile a very large set of labeled data to train the CNN.
  2. Use a pre-trained object detector: This approach that enables you to start with a pre-trained network and then fine-tune it for your application. Provides faster results.
  • Single-stage network

Here, anchor boxes are used. the CNN produces network predictions for regions across the entire image using anchor boxes, and the predictions are decoded to generate the final bounding boxes.

  • Two-stage network

There are two stages here:

  1. Identifies region proposals, or subsets of the image that might contain an object.
  2. Classifies the objects within the region proposals.
  • Object detection with machine learning This technique is commonly used for Object Detection and it is quite different from the rest.

You can decide to start with a pre-trained object detector or create a custom object detector. You will need to manually select the identifying features for an object.

Common machine learning techniques:

  1. Aggregate channel features (ACF).
  2. SVM classification using histograms of oriented gradient (HOG) features.
  3. The Viola-Jones algorithm for human face or upper body detection.

Models used for Object Detection

  • CNN family
  1. Region-based Convolutional Neural Network (R-CNN)
  2. Fast Region-based Convolutional Neural Network (Fast R-CNN)
  3. Faster Region-based Convolutional Neural Network (Faster R-CNN)
  • You Only Look Once(YOLO)
  • RetinaNet
  • Single Shot Detector (SSD)
  • Histogram of Oriented Gradients (HOG)
  • Region-based Fully Convolutional Network (R-FCN)

This article will focus on the CNN family and YOLO.

  • CNN family

Before explaining the models in this family, a brief look at CNN:

CNN(Convolutional Neural Networks) is a powerful neural network that extracts features from images. CNN does this is in such a way that position information of pixels is retained. CNN helps the computer to identify and learn from images. CNN can be implemented with the use of frameworks like Tensorflow keras and PyTorch or even with Numpy.

CNN is used for Facial recognition, Digitalization of paper/OCR and IOT devices.

1. R-CNN

R-CNN proposes a number of boxes in the image and checks if any of these boxes contain any object. It uses selective search to extract these boxes (also known as region) from an image.

Selective search: This method uses the patterns (colors, textiles, enclosure and varying scales) in an image and proposes various regions.

How Selective search works:

  • Takes an image as input.
  • Creates initial sub-segmentations, so there are many regions of the image. These regions produce final object location.

How R-CNN works:

  • Takes a pre-trained CNN.
  • Retrains the model (Number of classes that needs to be detected determines the last layer of the network to be trained).
  • Get the RoI (Region of Interest) using CNN, for each image.
  • Reshape these regions to match the CNN input size.
  • Train SVM (Support Vector Machines) to classify the objects and background in the image. We train one binary SVM for each class.
  • Train a Logistic Regression model to create tighter bounding boxes for each identified object in the image.

However, R-CNN is very slow.

Illustration of R-CNN:

2. Fast R-CNN

Since R-CNN is very slow and expensive, Ross Gorshick (author of R-CNN) came up with the idea of Fast R-CNN.

How Fast R-CNN works:

  • Takes an image as an input.
  • Get RoI (Region of Interest) with CNN.
  • RoI pooling layer is applied on these regions to reshape them. Each region is passed to a fully connected network.

The pooling layer is an important layer that executes the down-sampling on the feature maps coming from the previous layer and produces new feature maps with a condensed resolution.

  • Softmax layer is used to output the classes. A Linear Regression layer is also used alongside to output bounding box (bbox) coordinates for predicted classes.

The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability distribution. This means, softmax is used as the activation function for multi-class classification problems where class membership is required on more than two class labels. Simply, what a softmax layer does is, it takes in a number values from the RoI, convert these values to numbers between 0 and 1 such that the sum of these values must equal 1.

The softmax function:

Illustration of Fast R-CNN:

3. Faster R-CNN

The major difference between Fast R-CNN and Faster R-CNN is that Fast R-CNN uses selective search to generate RoI while Faster R-CNN uses RPN (Region Proposal Network).

How Faster R-CNN works:

  • Takes an image as input, passes it to CNN. CNN returns a feature map (generated by applying filters or features detector).
  • RPN is applied to the feature map. Returns object proposals alongside with their objectness score (measure of how well the detector identifies the locations and classes of objects during navigation).
  • RoI pooling layer is applied on the proposals to make all the proposals the same size.
  • The proposals are then passed to a fully connected layer with softmax layer and Linear Regression layer to classify and output the bounding boxes for objects.

Illustration of Faster R-CNN:

  • You Only Look Once(YOLO)

While the R-CNN family use regions to localize objects, YOLO takes the entire image in a single instance and predicts the bounding box coordinates and class probabilities for these boxes. YOLO is known for it’s superb speed.

How YOLO works:

  • Takes an input image.
  • Divides the input images into grids ( S x S dimension, where S could be any positive integer).
  • Image classification and localization are applied on each image.
  • Predicts the bounding boxes and their probabilities for objects.

Illustration of YOLO:

  • RetinaNet

RetinaNet is one of the best object detection models that has proven to work well with small scale objects.

  • Single Shot Detector(SSD)

From it’s name, to use this algorithm you need to take a single shot of the image/scene to detect the objects in the image/scene.

  • Histogram of Oriented Gradients(HOG)

This algorithm uses gradient orientation for it’s Object Detection. It is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

  • Region-based Fully Convolutional Network(R-FCN)

From it’s name, R-FCN is a Region-based Object Detection model. R-FCN implements with keras and Tensorflow.

Applications of Object Detection

  • Retail Stores: People-counting systems are placed in multiple retail stores are used to gather information about how customers spend their time in the store. This help to gain understanding of customer interaction and customer experience and make operations more efficient.
  • Autonomous Driving: Self-driving cars depend on object detection to recognize pedestrians, traffic signs, other vehicles, animals and more. The cars systems need to be able to identify, locate, and track objects around them in order to move through the world safely and efficiently. Object Detection made self-driving cars a reality.
  • Animal monitoring: Object Detection is used in agriculture for tasks such as counting, monitoring the animals and evaluation of agricultural products. Damaged produce can be detected while it is in processing with machine learning algorithms.
  • Video Surveillance: A lot of security applications in video surveillance use Object Detection, for example, to detect people in restricted or dangerous areas, prevent suicide, or inspection tasks.
  • Crowd counting : This is another application of object detection. For densely populated areas like parks and malls, Object Detection can help businesses measure different kinds of traffic, whether on foot, in vehicles, or not.

Guess you have the answers to the questions you had at the beginning of this article.

Thanks for reading!