Deep Learning is one of the most revolutionary and disruptive technologies ever developed in Data Science. Essentially, this is a class of algorithms inspired by how the human brain works, and it has the ability to automate and replace most of the world’s jobs. This is what enables self-driving cars to function and what allows Spotify to create very customized playlists and recommendations. This is how YouTube is able to identify faces and animals in videos and how Siri can understand and process free speech in milliseconds. Deep Learning has also led to several recent advancements in healthcare.
Deep Learning can be utilized to improve the efficiency of traditional models by automatically extracting new features. Processes that previously took experts prohibitive amounts of time, can now be automated and accomplished in a fraction of the time. It can also be used for extremely accurate forecasting and prediction. While the goal of this paper is to demystify Deep Learning, it does require a certain level of mathematical knowledge, and it is intended for a more technical audience.
The history of Neural Networks (NN) is diverse, and their origins date back to the 1940’s. However, a functioning version of a Neural Network did not appear until 1957. The Perceptron, developed by Frank Rosenblatt, Charles Wightman, and others, was a two layer network designed for pattern recognition. Perceptrons are limited in the sense that the output is essentially a linear combination of the inputs. Hence, Perceptrons failed to address certain problems like classification tasks that required modeling nonlinear boundaries. This inherent limitation led to the first spurn in this area of research.
The final note of this era was brought about by a campaign from Marvin Minsky and Seymour Papert to get funding diverted to the field of Artificial intelligence. In their 1969 book, Perceptrons, they noted that Perceptrons were unable to handle a simple XOR problem (Figure 1). This led to a rapid decrease in the number of active researchers, and the field went underground until the 1980’s.
Figure 1: There is no way to partition the 1’s and 0’s into their own sections groups using a straight line.
In the early 1980’s, Neural Networks saw a renaissance due to the work of John Hopfield and other established researchers. Hopfield popularized the idea of a hybrid network with multiple layers. The most common architecture of such a network is known as a Feed Forward Neural Network, or FNN.
Figure 2: A Feed Forward Neural Network
A Feed Forward Neural Network is comprised of an input layer, an output layer, and any number of hidden layers – usually just one, in a typical, non-deep neural network. The input layer is the first layer, while the output layer is the last layer. Each layer sends its outputs to the next layer without layers communicating with the layers “behind” them. The benefit of such a network is that every neuron in the internal layer is a nonlinear combination of the previous layer. While any level of non-linearity can be described using a single layer, the richness of the features is greatly enhanced when multiple layers are used. To use an analogy, human brains excel at capturing raw data at a sensor level (e.g. eyes and ears), and using layers of neurons to build multiple levels of representations of the input information using meta-features. Due to this, the brain perceives a different but more relevant description of the input. In a Neural Network, each layer creates better representations of the previous layers. In that sense, Neural Networks can be seen as feature generators. While a perceptron is a single neuron that can only do a linear separation of data spaces, a FNN can separate data in a non-linear fashion, which allows for much broader applications.
This architecture led to a new problem of how to train the multiple layers on data, which eventually led to the idea of back propagation, or Neural Networks with connections between neurons that could go both forward and backward. To train a FNN on a training set, one needs to find the weight vectors that minimize some error metric between predicted and actual values.
When a sample is run through a FNN in back propagation, the error is calculated, and then the weight vectors on the output layer are adjusted to reduce this error. These adjusted weight vectors are then propagated backwards through the network, so that all the weight vectors of previous layers can also be modified. This has to take place several times on the training data to achieve the desired accuracy. The adjustment of the weights is carried out using stochastic gradient descent:
w= w -n∇ Err (LinkedIn does not allow for the full equation, but you get the point)
Where n is called the learning rate and Err is the defined error function, typically squared error. The gradient is operating on Err with respect to the weight vectors.
Unfortunately, back propagation with multiple hidden layers posed substantial challenges like vanishing gradients and over-fitting. For example, while propagating the error in cases where there are multiple layers, the gradient would simply vanish after the first couple of layers. This made convergence more difficult and sub-optimal, and there was a high chance of getting stuck in local minima and model over-fitting.
In 2005, the problem of optimizing these networks was solved, in parallel, by a group of researchers from Stanford and the University of Toronto. The researchers figured out how to train each layer, one at a time, and used back propagation as a fine tuning element. This led to the ability of having unsupervised training and automatic feature extraction using these types of networks. During this time, the term Deep Learning was coined to describe these deep layered Neural Networks. These solutions were namely Auto Encoders and Restricted Boltzmann Machines. Both techniques are unsupervised, unlike back propagation, which is supervised.
The idea behind the use of Auto Encoders is to build richer feature sets that are by definition more compact than the input. This follows the argument made earlier regarding the human brain striving to create such compact representations for efficient reasoning. Auto-encoders consist of an encoder and a decoder. This represents itself as three layers of neurons, with an input and output layer, as well as a hidden layer.
Figure 3: Example of an Auto Encoder.
An Auto Encoder typically has fewer nodes in the hidden layer than in the input/output layers which share the same number of nodes. The reason for this is that Data Scientists typically seek to reduce the dimensionality of the data. For example, if the activation functions of the nodes in the hidden (k-nodes) layer are linear, then the auto-encoder is essentially copying the method of PCA and mapping the variables onto the k-principle axis. However, if the activation function is non-linear, then this allows the auto-encoder to capture multi-modal behavior in the input data.
As we do not have explicit target labels, we fix the target labels equal to the input, and force the hidden layer to contain fewer nodes. After an input vector, x, is entered into the auto-encoder, a hidden vector, y, is created by the hidden layer. This hidden vector represents the new encoding representation of the data based off of new features. On the output layer, the hidden vector is used to attempt to reconstruct the input vector, x’. To train the auto-encoder, an error function is defined using the output and input vectors, typically using squared error. This concept can be extended to multiple layers, where each subsequent layer ‘encodes’ the previous layer using significantly fewer neurons. Once this is found, back propagation is used to redefine all the weights as defined for FNN’s. This simple reduction of nodes at each layer, along with unsupervised learning, has led to phenomenal automated feature engineering and has dramatically outperformed the past 30 years of human feature engineering in many tasks.
For example, the image below shows a representation of a stack of layers in a DNN with each layer being more compact than the previous one. Let us assume that the task at hand is to identify objects in images. If we reconstruct sections of neurons at different points in the network, we can find a progression of feature hierarchies for corners and edges related to human and cat faces.
Figure 4: Hierarchical feature representation in deep neural networks. (Source: Google)
The inspiration for Sparse Auto Encoders came from the realization that the human brain does not restrict itself to the same features for each recognition task. For example, we use different unique features to remember different faces. Analogously, not all nodes in the hidden layer require activation for every input vector. To enforce this in a more formal way, Sparse Auto Encoders have many more hidden nodes. The activation function of the nodes in the hidden layer takes the values between 0 and 1. This can be thought of as the neuron not firing or firing respectively. An average rate of firing, ᾱ, can be defined for a node by averaging over all of its input variables. In Sparse Auto Encoders, we force the average firing rate to a fixed, small value, α, which causes most of the hidden nodes to become inactive. Because of this, a Sparse Auto Encoder may have a larger value of hidden nodes than the input layer, and only a small fraction of them are active at a time. This is achieved by imposing a penalty on the activation function anytime ᾱ deviates from α. Imposing this sparsity on the Auto Encoder significantly increases the performance of classification tasks.
A Restricted Boltzmann Machine (RBM) is a generative stochastic Artificial Neural Network invented by Geoffrey Hinton’s team at the University of Toronto. It is comprised of two layers of neurons. The purpose of a RBM is to learn a probability distribution over its set of inputs. RBMs strive to build a more noise resistant model with the premise that small perturbations in the inputs should not affect the prediction. Returning to the human brain analogy, humans have an amazing ability to recognize objects in noisy environments (e.g. recognizing someone by his/her face even as the face changes because of age, grooming habits, etc.).
Training of a RBM consists of optimizing weights such that probabilities assigned to a training set are maximized. One way to accomplish this is to use a method known as Contrastive Divergence (CD). This involves taking an input vector vand computing the probabilities of the hidden units to get a sample hidden vectorh. Hence, the outputs are activated in a stochastic manner, where the probability of the activation depends on the nonlinear combination of inputs. Then using the generated outputs an input vector v’ is reconstructed. Note that v’ is close to the original input, v, but not the same. We use v’ and run it again through the network to get a new hidden vector, h’. We use the difference between the outer products of (v,h) and (v’,h’) to update the weights. CD can be carried out any number of times to minimize the difference between h and h’.
Soothsayer Analytics engages with companies to solve challenging Data Science problems and actively works with state-of-the-art techniques. We help clients predict the future, optimize their business, and identify micro-patterns and hidden connections that traditional methods cannot. We build custom algorithms and analytic tools, contribute to advanced R&D, and help companies build internal Analytic Centers of Excellence. All of our Data Scientists hold a PhD or Masters and have a heavy background in Mathematics and Programming.