Subscribe to DSC Newsletter

Scope and approach

No taxonomy of Deep learning models exists. And I do not attempt to create one here either. Instead, I explore the evolution of Deep learning models by loosely classifying them into Classical Deep learning models andEmerging Deep Learning models. This is not an exact classification. Also, we embark on this exercise keeping our goal in mind i.e. the application of Deep learning models to Smart cities from the perspective of Security (Safety, Surveillance). From the standpoint of Deep learning models, we are interested in ‘Human activity recognition’ and its evolution. This will be explored in subsequent papers.

In this paper, we list the evolution of Deep Learning models and recent innovations. Deep Learning is a fast moving topic and we see innovation in many areas such as Time series, hardware innovations, RNNs etc. Where possible, I have included links to excellent materials / papers which can be used to explore further. Any comments and feedback welcome and I am happy to cross reference you if you can add to specific areas.  Finally, I would like to thanks Lee Omar, Xi Sizhe and Ben Blackmore all of Red Ninja Labs for their feedback

Deep Learning – learning through layers

Deep learning is often thought of as a set of algorithms that ‘mimics the brain’. A more accurate description would be an algorithm that ‘learns in layers’. Deep learning involves learning through layers which allows a computer to build a hierarchy of complex concepts out of simpler concepts. Deep learning algorithms apply to many areas including Computer Vision, Image recognition, pattern recognition, speech recognition, behaviour recognition etc

To understand the significance of Deep Learning algorithms, it’s important to understand how Computers think and learn. Since the early days, researchers have attempted to create computers that think. Until recently, this effort has been rules based adopting a ‘top down’ approach. The Top-down approach involved writing enough rules for all possible circumstances.  But this approach is obviously limited by the number of rules and by its finite rules base.

To overcome these limitations, a bottom-up approach was proposed. The idea here is to learn from experience. The experience was provided by ‘labelled data’. Labelled data is fed to a system and the system is trained based on the responses – leading to the field of Machine Learning. This approach works for applications like Spam filtering. However, most data (pictures, video feeds, sounds, etc.) is not labelled and if it is, it’s not labelled well.

The other issue is in handling problem domains which are not finite. For example, the problem domain in chess is complex but finite because there are a finite number of primitives (32 chess pieces)  and a finite set of allowable actions(on 64 squares).  But in real life, at any instant, we have potentially a large number or infinite alternatives. The problem domain is thus very large.

A problem like playing chess can be ‘described’ to a computer by a set of formal rules.  In contrast, many real world problems are easily understood by people (intuitive) but not easy to describe (represent) to a Computer (unlike Chess). Examples of such intuitive problems include recognizing words or faces in an image. Such problems are hard to describe to a Computer because the problem domain is not finite. Thus, the problem description suffers from the curse of dimensionality i.e. when the number of dimensions increase, the volume of the space increases so fast that the available data becomes sparse. Computers cannot be trained on sparse data. Such scenarios are not easy to describe because there is not enough data to adequately represent combinations represented by the dimensions. Nevertheless, such ‘infinite choice’ problems are common in daily life.

Deep learning is thus involved with ‘hard/intuitive’ problem which have little/no rules and high dimensionality. Here, the system must learn to cope with unforeseen circumstances without knowing the Rules in advance.

Feed forward back propagation network

The feed forward back propagation network is a model which mimics the neurons in the brain in a limited way. In this model:  a)      Each neuron receives a signal from the neurons in the previous layer b)      Each of those signals is multiplied by a weight value. c)      The weighted inputs are summed, and passed through a limiting function which scales the output to a fixed range of values. d)      The output of the limiter is then broadcast to all of the neurons in the next layer. The learning algorithm for this model is called Back Propagation (BP) which stands for “backward propagation of errors”. We apply the input values to the first layer, allow the signals to propagate through the network and read the output. A BP network learns by example i.e. we must provide a learning set that consists of some input examples and the known correct output for each case. So, we use these input-output examples to show the network what type of behaviour is expected. The BP algorithm allows the network to adapt by adjusting the weights by propagating the error value backwards through the network. Each link between neurons has a unique weighting value. The ‘intelligence’ of the network lies in the values of the weights. With each iteration of the errors flowing backwards, the weights are adjusted. The whole process is repeated for each of the example cases. Thus, to detect an Object, Programmers would train a neural network by rapidly sending across many digitized versions of data (for example, images)  containing those objects. If the network did not accurately recognize a particular pattern,  the weights would be adjusted. The eventual goal of this training is to get the network to consistently recognize the patterns that we recognize (ex Cats).

Building a hierarchy of complex concepts out of simpler concepts

Deep learning involves learning through layers which allows a computer to build a hierarchy of complex concepts out of simpler concepts. This approach works for subjective and intuitive problems which are difficult to articulate. Consider image data. Computers cannot understand the meaning of a collection of pixels. Mappings from a collection of pixels to a complex Object are complicated. With deep learning, the problem is broken down into a series of hierarchical mappings – with each mapping described by a specific layer.

The input (representing the variables we actually observe) is presented at the visible layer. Then a series of hidden layers extracts increasingly abstract features from the input with each layer concerned with a specific mapping. However, note that this process is not pre defined i.e. we do not specify what the layers select

For example: From the pixels, the first hidden layer identifies the edges

From the edges, the second hidden layer identifies the corners and contours

From the corners and contours, the third hidden layer identifies the parts of objects

Finally, from the parts of objects, the fourth hidden layer identifies whole objects

 Image and example source: Yoshua Bengio book – Deep Learning

Classical Deep Learning Models

Based on the above intuitive understanding of Deep learning, we now explore Deep learning models in more detail. No taxonomy of Deep learning models exists. Hence, we loosely classify Deep learning models into Classical and Emerging. In this section, we discuss the Classical Deep learning models.

Autoencoders: Feed forward neural networks and Back propagation

Feed forward neural networks (with back propagation as a training mechanism) are the best known and simplest Deep learning models. Back propagation is based on the classical optimisation method of steepest descent. In a more generic sense, Back propagation algorithms are a form of autoencoders. Autoencoders are simple learning circuits which aim to transform inputs into outputs with the least possible amount of distortion. While conceptually simple, they play an important role in machine learning. Autoencoders were first used in the 1980s by Hinton and others to address the problem of “backpropagation without a teacher”. In this case, the input data was used as the teacher and attempts were made to simulate the brain by mimickingHebbian learning rules (cells that fire together – wire together). Feedforward Neural Networks with many layers are also referred to as Deep Neural Networks (DNNs). There are many difficulties in training deep feedforward neural networks

 Deep belief networks

To overcome these issues, in 2006 Hinton et al. at University of Toronto introduced Deep Belief Networks (DBNs) – which is considered a breakthrough for Deep learning algorithms.

Here, the learning algorithm greedily trains one layer at a time, with layers created by stacked Restricted Boltzmann Machines (RBM) (instead of stacked autoencoders). Here, Restricted Boltzmann Machines (RBMS), are stacked and trained bottom up in unsupervised fashion, followed by a supervised learning phase to train the top layer and fine-tune the entire architecture. The bottom up phase is agnostic with respect to the final task. A simple introduction to Restricted Boltzmann machines is HERE where the Intuition behind RBMs is explained by considering some visible random variables (film reviews from different users) and some hidden variables (like film genres or other internal features). The task of the RBMs is to find out through training as to how these two sets of variables are actually connected to each other.

Convolutional Neural Networks (CNN)

Convolutional Neural Networks are similar to Autoencoders and RBMs but instead of learning single global weight matrix between two layers, they aim to find a set of locally connected neurons through filters (kernels). (adapted from stackoverflow). CNNs are mostly used in image recognition. Their name comes from “convolution” operator. A tutorial on feature extraction using convolution explains more.  CNNs use data-specific kernels to find locally connected neurons. Similar to autoencoders or RBMs, they also translate many low-level features (e.g. user reviews or image pixels) to the compressed high-level representation (e.g. film genres or edges) – but now weights are learned only from neurons that are spatially close to each other. Thus, a Convolutional Neural Network (CNN) is comprised of one or more convolutional layers and then followed by one or more fully connected layers as in a standard multilayer neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units. A CNN tutorial HERE

Recurrent neural networks (RNNs)

recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behaviour. Unlike feedforward neural networks, RNNs can use their internal memory to process arbitrary sequences of inputs. This makes them applicable to tasks such as unsegmented connected handwriting recognition, where they have achieved the best known results.

The fundamental feature of a Recurrent Neural Network (RNN) is that the network contains at least one feed-back connection, so the activations can flow round in a loop. That enables the networks to do temporal processing and learn sequences, e.g., perform sequence recognition/reproduction or temporal association/prediction.Thus, feedforward networks use Directed acyclic graphs whereas Recurrent neural networks use Digraphs (Directed graphs). See also this excellent tutorial – Deep Dive into recurrent neural networks by Nikhil Buduma

Emerging Deep learning models

In the above section, we saw the main Deep learning models. Deep learning techniques are rapidly evolving. Much of the innovation takes place in combining different forms of learning with existing Deep learning techniques.  Learning algorithms fall into three groups with respect to the sort of feedback that the learner has access to: supervised learning, unsupervised learning and reinforcement learning. We also see emerging areas like application of Deep Learning to Time series data. In the section below, we discuss Emerging Deep learning models.  The list is not exhaustive because the papers and techniques selected are more relevant to our problem domain(Application of Deep learning techniques for Smart cities with an emphasis on Human activity monitoring for Security/Surveillance)

Application of Reinforcement learning to Neural networks

Playing Atari with reinforcement learning presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. The method is applied to  seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. It is found that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them. The paper Deep learning for reinforcement learning in Pacman (q-learning) also addresses similar issues but for the game Pacman. DeepMind(now a part of Google) has a number of papers on reinforcement learningSascha Lange and Martin Riedmiller  apply Deep Auto-Encoder Neural Networks in Reinforcement Learning. The paper Recurrent Models of Visual Attention by Volodymyr Mnih Nicolas Heess Alex Graves Koray Kavukcuoglu of Google DeepMind presents a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. It can be trained using reinforcement learning methods to learn task-specific policies.

 Combining modalities for Deep learning

Multimodality is also an area of innovation for Deep learning networks. Multimodal networks learn from different types of data sources for example training video, audio and text together(usually video, audio and text are distinct training modes). The paper Multimodal deep learning  proposes a deep autoencoder considers the cross modality learning setting where both modalities are present (video and audio) during feature learning but only a single modality is used for supervised training and testing.

In the paper Joint Deep Learning for Pedestrian Detection Wanli Ouyang and Xiaogang Wang use CNNs but add deformation layer to classify the parts. Feature extraction, deformation handling, occlusion handling, and classification are four important components in pedestrian detection. This paper proposes that they should be jointly learned in order to maximize their strengths through cooperation.

Deep Learning of Invariant Spatio-Temporal Features from Video uses the convolutional Restricted Boltzmann machine (CRBM) as a basic processing unit. Their model(Space-Time Deep Belief Network) – ST-DBN, alternates the aggregation of spatial and temporal information so that higher layers capture longer range statistical dependencies in both space and time.


Another area for innovation and evolution of Deep learning is Parallelization. For example, Deep learning on Hadoop at Paypal and  Massively Parallel Methods for Deep Reinforcement Learning


Time Series

Because IoT/Smart series Data is mostly Time Series data, the use of Time Series with Deep Learning is also relevant to our work. In most cases, RNNs of DBNs are used to not only make a prediction but also (like NEST) to adapt. The paper Deep Learning for Time Series modelling forecasts demand i.e. predicts energy loads across different network grid areas using only time and temperature data. The paper uses hourly demand for four and a half years from 20 different geographic regions, and similar hourly temperature readings from 11 zones. Time Series Classification Using Multi-Channels Deep Convolutional ... uses deep learning framework for multivariate time series classification and the paper by Gilberto Batres-Estrada uses Deep Learning for Multivariate Financial Time Series

Cognitive computing

Ultimately, we can expect many services to be Cognitive.

An algorithmic framework will be called cognitive if it has the following properties: 1. it integrates knowledge from (a) various structured or unstructured sources, (b) past experience, and (c) current state, in order to reason with this knowledge as well as to adapt over time; 2. it interacts with the user (e.g., by natural language or visualization) and reasons based on such interactions; and 3. it can generate novel hypotheses and capabilities, and test their effectiveness. Source: Cognitive Automation of Data Science. Deep learning is increasingly becoming a part of Cognitive computing


Some additional notes:

Deep Learning in contrast to other machine learning techniques

To recap, a more formal definition of Deep Learning – Deep Learning: a class of machine learning techniques, where many layers of information processing stages in hierarchical architectures are exploited for unsupervised feature learning and for pattern analysis/classification. The essence of deep learning is to compute hierarchical features or representations of the observational data, where the higher-level features or factors are defined from lower-level ones.

Historically, Deep Learning is a form of the fundamental credit assignment problem (Minsky, 1963). Here,  Learning or credit assignment is about finding weights that make the neural network exhibit desired behaviour, such as driving a car. Deep Learning is about accurately assigning credit across many such stages. Historical reference through Marvin Minsky’s papers

Deep learning techniques can also be contrasted to more traditional machine learning techniques. When we represent some object as a vector of n elements, we say that this is a vector in n-dimensional space. Thus, dimensionality reduction refers to a process of refining data in such a way, that each data vector x is translated into another vector x′ in an m-dimensional space (vector with m elements), where m<n.  The most common way of doing this is PCA (Principal Component Analysis). PCA finds “internal axes” of a dataset (called “components”) and sorts them by their importance. The first m most important components are then used as new basis. Each of these components may be thought of as a high-level feature, describing data vectors better than original axes.

Both – autoencoders and RBMs – do the same thing. Taking a vector in n-dimensional space they translate it into an m-dimensional one, trying to keep as much important information as possible and, at the same time, remove noise. If training of autoencoder/RBM was successful, each element of resulting vector (i.e. each hidden unit) represents something important about the object – shape of an eyebrow in an image, genre of a film, field of study in scientific article, etc. You take lots of noisy data as an input and produce much less data in a much more efficient representation. In the above image, we see an example of such a deep network. We start with ordinary pixels, proceed with simple filters, then with face elements and finally end up with entire faces. This is the essence of deep learning. (Adapted from stackexcahnge).

So, one could ask: If we already have techniques like PCA, why do we need autoencoders and RBMs? The reason is: PCA only allows linear transformation of a data vectors. Autoencoders and RBMs, on other hand, are non-linear by the nature, and thus, they can learn more complicated relations between visible and hidden units. Moreover, they can be stacked, which makes them even more powerful. Most problems addressed by Deep learning neural networks are not linear i.e. if we were able to model relationships linearly between the independent and dependent variable, classic regression techniques apply. The paper Deep neural networks as recursive generalised linear models (RGLMs) explains the applicability of Deep Learning techniques to non-linear problems from a statistical standpoint

Deep Learning and Feature learning

Deep Learning can be hence seen as a more complete, hierarchical and a ‘bottom up’ way for feature extraction and without human intervention. Deep Learning is a form of Pattern Recognition system and the performance of a pattern recognition system heavily depends on feature representation. In the past, manually designed features were used for image and video processing. These rely on human domain knowledge and it is hard to manually tune them.  Thus, developing effective features for new applications is a slow process. Deep learning overcomes this problem of feature extraction. Deep learning also distinguishes multiple factors and a hierarchy in video and audio data for example Objects (sky, cars, roads, buildings, pedestrians),  parts (wheels, doors, heads) can be decomposed from images. For this task, more layers provide greater granularity. For exampleGoogle net has more than 20 layers

Source: ELEG 5040 Advanced Topics on Signal Processing (Introduction to Deep Learning) by Xiaogang Wang

Deep learning and Classification techniques

None of deep learning models discussed here work as classification algorithms. Instead, they can be seen as Pretrainin , automated feature selection and learning, creating a hierarchy of features etc. Once trained (features are selected), the input vectors are transformed into a better representation and these are in turn passed on to a real classifier such as SVM or Logistic regression.  This can be represented as below.

 Source: ELEG 5040 Advanced Topics on Signal Processing (Introduction to Deep Learning) by Xiaogang Wang

Advances in Hardware

Another major source for innovation in Deep learning networks is Hardware.  The impact of hardware on Deep Learning is a complex topic – but two examples are: The Qualcomm zeroth platform that brings cognitive and Deep learning capabilities – including to Mobile devices. Similarly, the NVIDIA cuDNN – GPU Accelerated Deep Learning

DBNs to pre-train DNNs

Finally, Deep learning techniques have synergies amongst themselves. We explained DBNs and DNNs above.  DBNs and DNNs can be used in conjunction i.e. Deep Belief Net (that use RBM for layer-wise training) can be used as the pre-training method for a Deep neural network.


This paper is a part of a series covering Deep Learning applications for Smart cities/IoT with an emphasis on Security (human activity detection, surveillance etc). Subsequent parts of this paper will cover human activity detection and Smart cities. The content is a part of a personalized Data Science course I teach (online and offline) Personalized Data Science for Internet of Things course. I am also looking for academic collaborators to jointly publish similar work. If you want to be a part of the personalized Data Science course or collaborate academically,  please contact me at ajit.jaokar at or connect with me on Linkedin Ajit Jaokar

PS – This paper is available as a pdf Evolution of Deep Learning Models  

Views: 7968


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us


  • Add Videos
  • View All


© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service