Having experienced the potential, power and promise of Recurrent Neural Networks in great details, we are now ready to face the most advanced technological breakthrough amidst non-recurrent, deep, multi-layered, feed-forward artificial neural networks that have been proven to be the most successful means of recognizing and analysing visual imagery. Welcome to the world of Convolutional Neural Nets (CNNs)!

CNNs typically use the most advanced variation of Feed-Forward, Multi-layered Perceptrons that are designed in a way to required minimal prepossessing techniques for video recognition or image classification purposes purely using unsupervised Machine Learning and Artificial Intelligence.

In this article, we will explore Convolutional Neural Networks (CNNs) on a high level and discuss how they are inspired by the structure of the human brain. We will then delve into some of their exciting structural and functional applications in greater details.

CNN Intro


Rebuilding the Human Visual Processing System

We are constantly analysing the world around us primarily through our visual perception of it coupled with some efficient and fine-tuned, subconscious cognitive processing. Without conscious effort, we make predictions about everything we see, and act upon them. When we see something, we automatically label every object based on what we have learned about it in the past. To demonstrate this idea, let us look at the following picture below:



Without any prior knowledge or background information regarding this image, one could still arrive at conclusions such as “that’s a happy little boy standing on a chair”. Or one could think that the little boy is screaming and is ready to attack the cake in front of him. Whatever the case maybe, we are subconsciously always classifying subjects and objects in the world around us and labelling them, even unintentionally at times.



This is what we subconsciously do all day. We see, label, make predictions, and recognize patterns every day. But how do we do that? How is it that we can interpret everything that we see?



It took nature over 500 million years to create a system to do this. The collaboration between the eyes and the brain, called the primary visual pathway, is the reason we can make sense of the world around us. While vision starts in the eyes, the actual interpretation of what we see happens in the brain, in the primary visual cortex, as illustrated below.

So just by following the illustration we can make it simpler for one to understand the visual pathways of perception and its interpretation in the human brain. When we see an object, the light receptors in our eyes send signals via the optic nerve to the primary visual cortex, where the input is being processed. The primary visual cortex makes sense of what the eye sees. All of this seems very natural to us. We barely even think about how special it is that we are able to recognise all the objects and people we see in our lives. However, in truth, the deeply complex hierarchical structure of neurons and connections in the brain play a major role in this process of remembering and labelling objects.

Think about how we learned stuff for the first time in our lives, for example, identifying and distinguishing an umbrella, a duck, a lamp, a candle, or a book. In the beginning, our parents or our teachers told us the name of the objects in our direct environment. We learned by examples that were given to us. Slowly but surely, we started to recognise certain things more and more often in our environment. They became so common that the next time we saw them, we would instantly know what the name of that object was. This is how they became an integral part of our model on the world.

So, Similar to how a child learns to recognise objects, we need to show an algorithm millions of pictures before it is be able to generalize the input and make predictions for images it has never seen before. However, computers ‘see’ in a different way than we do. Their world consists only of numbers. But the good news is that, every image can be represented as 2-dimensional arrays of numbers, known as pixels. But the fact that they perceive images in a different way, doesn’t mean we can’t train them to recognize patterns, like we do. We just have to think of what an image is in a different way as shown in the illustration below:



Here the computer sees the above image not as a cat but as an array of numbers which represent a corresponding local distinguishing feature within the object under consideration. For a cat such a distinguishing feature can be its prominent whiskers coupled with its facial structure and body posture which would segregate it probabilistically form other creatures such as dogs or inanimate objects such as hats or mugs as shown in the diagram. But in order to achieve this marvellous feat in terms of actually making the computer recognize this image as that of a cat, we have to teach it an effective and efficient algorithm on how to identify such objects in any given image. For this, we use a specific type of Artificial Neural Network called a Convolutional Neural Network (CNN). Its very name stems from one of the most important operations in the network: convolution which has a subtly different meaning in computational mathematics as opposed to popular literature.

In literature as we all know, convolution brings to mind a spiral, helix-shaped twisted coil like structure or often refers to things which have intricate structural and hierarchal complexities. However, in computational math, a Convolution is a mathematical operation on two functions (f and g) to produce a third function that expresses how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it. For discrete, real-valued functions, they differ only in a time reversal in one of the functions. For continuous functions, the cross-correlation operator is the adjoint of the convolution operator. Thus, for CNNs convolutional filtering plays an important role in many important algorithms in edge detection and related processes of image/textual recognition.


As highlighted earlier, Convolutional Neural Networks are inspired by the brain and primarily modelled on the structural framework of the human visual system. Research in the 1950s and 1960s by D.H Hubel and T.N Wiesel on the brain of mammals suggested a new model for how mammals perceive the world visually around us. In this model they showed that the cat and monkey visual cortexes include neurons that exclusively respond to neurons in their direct environment.

In their paper, they described two basic types of visual neuron cells in the brain that each act in a different way: Simple cells (S cells) and Complex cells (C cells). In every visual system, the simple cells activate, for example, when they identify basic shapes as lines in a fixed area and a specific angle. The complex cells have larger receptive fields and their output is not sensitive to the specific position in the field.

The complex cells on the other hand, continue to respond to a certain stimulus, even though its absolute position on the retina keeps changing. So complex basically refers to more flexible, in this case. In biological visual systems, a receptive field of a single sensory neuron is the specific region of the retina in which something will affect the firing of that neuron (that is, will active the specific neuron e.g. the famously discovered Hale Berry Neurons show us how specifically certain neurons fire in the V1 section of the visual cortex in response to a very specific stimulus). Every sensory neuron cell has similar receptive fields, and their fields overlay on one another and they work together as shown in the diagram below.



Furthermore, the concept of hierarchy also plays a significant role in the brain. Information is stored in sequences of patterns, in sequential order. The neocortex, which is the outermost layer of the brain, stores information hierarchically. It is stored in cortical columns, or uniformly organised groupings of neurons in the neocortex. In 1980, a researcher called Fukushima proposed a hierarchical neural network model. He called it the Neocognitron. This model was inspired by the concepts of the Simple and Complex cells. The Neocognitron was able to recognise patterns by learning about the shapes of objects. Later, in 1998, inspired by this very model, Convolutional Neural Networks were introduced for the first time in a paper by Bengio, Le Cun, Bottou and Haffner. Their first Convolutional Neural Network was called LeNet-5 and was able to classify digits from hand-written numbers.

The LeNet Architecture

LeNet was one of the very first Convolutional Neural Networks which helped propel the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988. At that time the LeNet architecture was used mainly for character recognition tasks such as reading zip codes, digits, etc.

Below, we will develop an intuition of how the LeNet architecture learns to recognize images. There have been several new architectures proposed in the recent years which are improvements over the LeNet, but they all use the main concepts from the LeNet and are relatively easier to understand if we have a clear understanding of the former.

The Convolutional Neural Network in the figure below, is similar in architecture to the original LeNet and classifies an input image into four categories: dog, cat, boat or bird (the original LeNet was used mainly for character recognition tasks). As evident from the figure above, on receiving a boat image as input, the network correctly assigns the highest probability for boat (0.94) among all four categories. The sum of all probabilities in the output layer should be one (explained later in this post).



There are four main operations in the CNNs as depicted in the figure above:

  • Convolution
  • Non-Linearity (ReLU)
  • Pooling or Sub Sampling
  • Classification (Fully Connected Layer)

These operations are the basic building blocks of every Convolutional Neural Network, so understanding how all this works together, is an important step to developing a sound understanding of CNNs.

Framework Components of CNNs

Having discussed the history and the first development of CNNs, we can now focus on how CCNs work and what components constitute its primary structural and functional frameworks. CNNs primarily consist of two components i.e. the Hidden layers/Feature extraction Layer and the Classification Layer.

1.The Hidden Feature Extraction Layer

In this part, the network is basically programmed to perform a series of convolutions and pooling operations during which the features are detected. So, if you had a picture of a zebra, this is the part where the network would recognise its stripes, two ears, and four legs.

The primary purpose of Convolution in case of a CNN is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.

As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):

Also, consider another 3 x 3 matrix as shown below:
CNN Matrix_2
Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the Figure below:

Let us now, take a moment to understand how the computation above is being done. We slide the orange matrix over our original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3×3 matrix “sees” only a part of the input image in each stride.

In CNN terminology, the 3×3 matrix is called a ‘filter’ or ‘kernel’ or ‘feature detector’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map’. It is important to note that filters act as feature detectors from the original input image.

Similarly, we perform numerous such convolutions on our input, where each operation uses a different filter. This results in different feature maps. In the end, we take all of these feature maps and put them together as the final output of the convolution layer. Just like any other Neural Network, we use an activation function to make our output non-linear. In the case of a Convolutional Neural Network, the output of the convolution will be passed through the activation function. This would be the non-linear ReLU activation function followed by a ‘Stride’.

A Stride is the size of the step the convolution filter moves each time. A stride size is usually 1, meaning the filter slides pixel by pixel. By increasing the stride size, your filter is sliding over the input with a larger interval and thus has less overlap between the cells as shown below.



Because the size of the feature map is always smaller than the input, we have to do something to prevent our feature map from shrinking. This is where we use something called ‘Padding’.

A layer of zero-value pixels is added to surround the input with zeros, so that our feature map will not shrink. In addition to keeping the spatial size constant after performing convolution, padding also improves performance and makes sure the kernel and stride size will fit in the input.

After a convolution layer, it is common to add a ‘Pooling Layer’ in between CNN layers. The function of pooling is to continuously reduce the dimensionality to reduce the number of parameters and computation in the network. This shortens the training time and controls overfitting.

The most frequent type of pooling is max pooling, which takes the maximum value in each window. These window sizes need to be specified beforehand. This decreases the feature map size while at the same time keeping the significant information as depicted below.




Thus, before moving on to classification, we can summarise the five most important parameters that are to be considered in the feature selection layer as follows:

  • The kernel size
  • The filter count (that is, how many filters do we want to use)
  • Stride (how big are the steps of the filter)
  • Padding
  • Max Pooling

2. The Classification Layer

After the convolution and pooling layers, our classification part consists of a few fully connected layers. However, these fully connected layers can only accept 1 Dimensional data. To convert our 3D data to 1D, we use the function flatten in Python. This essentially arranges our 3D volume into a 1D vector.
The last layers of a Convolutional NN are fully connected layers. Neurons in a fully connected layer have full connections to all the activations in the previous layer. This part is in principle the same as a regular Multi-Layered Perception Neural Network as discussed before.
Having laid the foundation framework of CNNs we shall now look forward to training the network and making certain advanced modifications to maximize its efficiency in real-world scenarios. Although, training a CNN works in the same way as a regular neural network, using back-propagration or gradient descent. However, here this is a bit more mathematically complex because of the convolution operations and shall be covered in details on our next blog article.
Thus, in summary, it is evident that CNNs are especially useful for image/video classification and character recognition. They have two main parts: a feature extraction part and a classification part. The main special technique in CNNs is convolution, where a filter slides over the input and merges the input value + the filter value on the feature map. In the end, our goal is to feed new images to our CNN so it can give a probability for the object it thinks it sees or describe an image with appropriate text.


So, stay tuned with us to fulfil that goal on our next blog article where we shall take you above and beyond the latest breakthrough applications of CNNs!!!

Leave a Reply

Your email address will not be published. Required fields are marked *