Deep Learning and Convolutional Neural Networks: RSIP Vision Blogs


From robots to drug design, it’s hard to miss “deep learning” in the news and in our office lately. Indeed, RSIP Vision is utilizing deep learning and convolutional neural networks in our classification work. What is deep learning and why have been hearing about it so much? This article is aimed at data scientists and machine-learning practitioners who might have heard a little about deep learning and are interested in knowing what the buzz is all about. Irrespective of your background, hopefully you will see how deep learning might be applied to your field. At the very least, you will be better equipped to take media reports about deep learning with a grain of salt!

Deep Learning, what is it?

Briefly, deep learning is:

  1. A host of statistical machine learning techniques
  2. Enables the automatic learning of feature hierarchies
  3. Usually based on artificial neural networks

That is the gist. For something that appears to be fairly straightforward, there’s a lot of hype in the news about what has been achieved and what might be done with deep learning techniques in the future. Let’s start with an example of what has already been achieved to show why it has been garnering so much attention.

How is it Different?

To answer this question, like Dallin Akagi does in his excellent post on the topic, we’ll discuss deep learning in the context of the 2013 Kaggle-hosted quest to save the whales. The challenge was: given a set of 2-second sound clips from buoys in the ocean, classify each sound clip as containing a call from a North Atlantic Right whale or not. The idea behind this task is that if we can detect the whales’ migration paths by picking up their calls, we can route shipping traffic to avoid them. This would aid all-important whale preservation, as well as working towards effective and safe shipping.

In an interview after the competition, the winners of the competition emphasized the importance of feature generation, also called feature engineering. It is well-known that data scientists spend most of their time, effort, and creativity on engineering good features. On the other hand, they spend relatively little time running the actual machine learning algorithms. A toy example of an engineered feature would involve multiplying two columns and including this new resulting column as an additional descriptor (feature) of your data. In the case of the whales, the winning team transformed each sound to its spectrogram form and built features that describe how well the spectrogram matched some example templates (template matching). Next, they iterated new features that would help them correctly classify examples that they got wrong through the use of a previous set of features. As you can imagine, this process takes time, effort and considerable expertise.


Deep Learning approaches


The above figure shows the final standings for the competition. As you can see, the results between the top contenders were tight, and the winning team’s focus on feature engineering evidently paid off. But why is it that several deep learning approaches were so competitive although they used as few as one quarter of the submissions? One immediate answer arises from the unsupervised feature learning that deep learning can perform. Instead of relying on experience, intuition, and trial-and-error, unsupervised feature learning techniques utilize computational time, and an abundance of data,to develop (mostly) new ways of representing the data. The end goal is the same, but the process to get there can be dramatically different.

This novel approach of foregoing manual feature engineering, was also echoed in a Competitive Kaggle Data Science panel. Top Kagglers asserted: “Try deep learning to do automatic feature engineering – automation is good [for industry]“, “The only way to stand out in data science competitions is to have better features” and finally “I use all: First GLM, RF, GBM, then try to beat them all with Deep Learning“.

Finally, in an interview with the winner of the Merck Kaggle Challenge, George Dahl (supervised by Geoffrey Hinton, see below) described their approach:

“…our goal was to demonstrate the power of our models, we did no feature engineering and only minimal preprocessing. The only preprocessing we did was occasionally, for some models, to log-transform each individual input feature/covariate. Whenever possible, we prefer to learn features rather than engineer them.”


Deep Learning Components


However, a common misconception is that ‘deep learning’ and ‘unsupervised learning’ are the same concept. There are various unsupervised learning techniques that have nothing to do with neural networks at all, and neural networks have been used for supervised learning tasks for decades. Moreover, deep learning methods using reinforcement or semi-supervised learning have been applied successfully in recent years, but the latter two have been used for decades now.

As Dallin Akagi succinctly summarizes,

“The takeaway is that deep learning excels in tasks where the basic unit, a single pixel, a single frequency, or a single word/character has little meaning in and of itself, but a combination of such units has a useful meaning.”

Deep learning can learn such useful combinations of values without human intervention. The ubiquitous example used when discussing deep learning’s ability to learn features from data is the MNIST dataset of handwritten digits. When presented with tens of thousands of handwritten digits a deep neural network can learn that it is useful to look for loops and lines when trying to classify the digits.


Chart showing how deep neural networks learn handwriting


The raw input digits are on the left of the image. On the right, we see graphical representations of the learned features (filters). Essentially, the network learns to detect lines and loops.
Consult with RSIP Vision now

What Garnered Renewed Attention for Neural Networks?

The question that many people would ask at this point is, is this not just the well-known neural network returning to the foreground?

Neural networks gained tremendous popularity back in the 1980s, peaking in the early 1990s. Their use slowly declined after that. There was a lot of buzz and some high expectations, but in the end the neural network models did not live up to their potential. So, what was the problem? The answer lies in the ‘deep’ component of deep learning.

Standard neural networks are composed of layers of “neurons”. These layers are usually feed forward only and are trained by examples (for classification or regression). Primate brains do a similar thing in the visual cortex, so the hope was that using more layers in a neural network would allow it to learn better models. However, researchers found that training models with many layers doesn’t work. The general understanding came to be that only shallow networks (1-2 layers) can be trained successfully. The standard shallow neural network has only a single layer of data representation (see figure below). Learning in a deep neural network, one with more than one or two layers of data representation, appeared to be unfeasible. In reality, deep learning has been around for as long as neural networks have – we just couldn’t get it to work.


Shallow Neural Network Diagram


Deep Neural Network Diagram


Deep neural networks have more than two hidden layers: simple.


The Breakthrough

The breakthrough came in 2006, when three separate groups developed methods to overcome the difficulties that encountered when trying to train deep neural networks. The heads of these three groups (Geoffrey Hinton, Yann LeCun and Yoshua Bengio) are recognized as the fathers of the age of deep learning. Together, they have ushered in a new epoch in machine learning, bringing new life into neural networks when many had given up on them years ago. Showing the stature of their work, today Geoffrey Hinton divides his time working for Google and the University of Toronto; Yann LeCun is Director of AI Research at Facebook; and Yoshua Bengio holds a position as research chair for Artificial Intelligence at University of Montreal.

So, how did they get deep neural networks to work? Before 2006, the earliest layers in a deep neural network simply were not able to learn useful representations of the data. In many cases they failed to learn anything at all. Instead the parameters of these layers stayed close to their random initializations. This was due to the nature of the training algorithm for neural networks. However, using different (and mostly novel) techniques, each of these three groups was able to get these early layers to learn useful representations. This resulted in much more powerful neural networks. A full description of their contributions is beyond the scope of this post. Briefly some of the things that made deep learning possible were: using a simple optimizer, namely stochastic gradient descent; using unsupervised data to pre-train models to automate feature extraction; improvements to the neural networks themselves (transfer functions and initialization); using larger and larger data sets; and finally using GPUs to accommodate the considerable computational costs incurred by deep neural network models combined with large datasets.


Deep Neural Networks Learn Facial Features


Each successive layer in a neural network uses features from the previous layer to learn more complex features.

Now that the aforementioned problem has been circumvented, we ask, what is it that these neural networks learn? Let’s consider the above example, at the lowest level, the network fixates on patterns of local contrast as important. The following layer is then able to use those patterns of local contrast to fixate on things that resemble eyes, noses, and mouths. Finally, the top layer is able to apply those facial features to face templates. A deep neural network is capable of composing features of increasing complexity in each of its successive layers.

It is this automated learning of data representations and features that the hype is all about. Such an application of deep neural networks has seen models that successfully learn useful hierarchical representations of images, audio and written language. These learned feature hierarchies in these domains can be construed as:

Image recognition: Pixel → edge → texton → motif → part →object

Text: Character → word → word group → clause → sentence

Speech: Sample → spectral band → sound →…→ phone → phoneme → word

Some of these used to be considered to be hard problems in machine learning, which is why deep neural networks attract so much attention. It is a safe bet to assume that deep learning would be the secret ingredient in more projects in the future.

In summary, these breakthroughs have enabled deep neural networks that are able to automatically learn rich representations of data. This accomplishment has proven particularly useful in areas like computer vision, speech recognition, and natural language processing.

Convolutional Neural Networks

Convolutional neural networks (CNNs), a variant of deep learning, were motivated by neurobiological research on locally sensitive and orientation-selective nerve cells in the visual cortex. Convolutional Neural Networks are a special kind of multi-layer neural networks, with the following characteristic: a CNN is a feed forward network that can extract topological properties from an image. Like almost every other neural network, it is trained with a version of the back-propagation algorithm. CNNs are designed to recognize visual patterns directly from pixel images with little-to-none preprocessing. They can recognize patterns with extreme variability, such as handwritten text and natural images. A CNN typically consists of a convolution layers, subsampling layers, and a fully connected layer. In CNN, successive layers of convolution and subsampling are typically alternated. A Schematic diagram of CNN is shown below.


Early Convolution Network Design - Le-Net5


Convolutional Neural Networks have only recently become mainstream in computer vision applications. Over the past 3 years CNNs have achieved state-of-the-art performance in a broad array of high-level computer vision problems, including image classification, object detection, fine-grained categorization, image segmentation, pose estimation, and OCR in natural images among others. Usually in these works,the CNNs are trained in an end-to-end manner and deliver strikingly better results than systems relying on carefully engineered representations, such as SIFT or HOG features. This success can be partially attributed to the built-in invariance of CNNs to local image transformations, which underpins their ability to learn hierarchical abstractions of the data.

Advanced Object Recognition using CNNs


RSIP Vision is currently working with CNNs in our classification work, particularly with white blood cells. We have noticed this method produces faster and more accurate results over other classification techniques. We are advancing our work in deep learning and convolutional neural networks and look forward to passing on the advantages of these networks to our clients. Stay tuned!
Consult now with RSIP Vision
Share The Story