On-going Developments and Outlook for Deep Learning


There are huge numbers of variants of deep architectures as it’s a fast developing field and so it helps to mention other leading algorithms. The list is intended to be comprehensive but not exhaustive since so many algorithms are being developed [1] [2][1],[2].

  1. Deep High-order Neural Network with Structured Output (HNNSO).
  2. Deep convex network.
  3. Spectral networks
  4. noBackTrack algorithm to solve the online training of RNN (recurrent neural networks) problem
  5. Neural reasoner
  6. Reccurrent Neural Networks
  7. Long short term memory
  8. Hidden Markov Models
  9. Deep belief network
  10. Convolutional deep networks
  11. LAMSTAR are increasingly being used in medical and financial applications. LAMSTAR is Large memory storage and retrieval neural networks.
  12. Deep Q-network agent. Google DeepMIND uses this and it is based on reinforcement learning which is a major branch of psychology, aside from evolution. 

Additionally, Fuzzy logic models can also be used with other models such as decision trees, hidden Markov and Bayesian and artificial neural networks to model complicated risk issues like policyholder behaviours. A risk assessment and decision-making platform for ratemaking built on a fuzzy logic system can provide consistency when analyzing risks with limited data and knowledge. It allows people to focus on the foundation of risk assessment, which involves the cause-and-effect relationship between key factors as well as the exposure for each individual risk. Rather than a direct input for the likelihood and severity of a risk event, it supports human reasoning from the facts and knowledge to the conclusion in a comprehensive and reliable manner [3].[3]

On-going Developments and Outlook for Deep Learning

Hyper-parameter turning

One issue however with Deep Learning is trying to find the hyper-parameters that are optimum. The possible space for consideration is very large and it is difficult and computationally intensive to understand each hyper parameter in depth. We also cannot write down the actual formula for the response surface that we are optimizing.

One potential solution which the author of this report identifies is the possible use of genetic algorithm to find optimal hyper parameters. Genetic algorithms are already used on GLMs on R ‘glmulti’ package to select optimum GLM equation as per a given criteria usually Akaike Information Criterion or Bayesian Information Criterion.

Moreover, another algorithm has been used to optimize both structure and weights of a neural network. ES HyperNEAT is Evolving Substrate Hyperbolic Neuroevolution Of Augmenting Topologies developed by Ken Stanley. It uses a genetic algorithm to optimize both the structure and weights of a neural network. Following from this, maybe ES HyperNEAT framework can be extended to Deep Learning so that genetic al genetic algorithm can optimize both the structure and weights of the neural networks in Deep Learning as well [4].[4]

Another problem is over fitting. Aside from Dropout, Machine unlearning can also be used to solve this. Machine unlearning puts a new layer of small number of summations between the training data and the learning algorithm so that the dependency between these two is eliminate. Now the learning algorithms depend only on the summations instead of the individual data from which over-fitting can arise more easily. No retraining of remodeling is required [5].[5]

There are many other hyperparameter tuning algorithms now [6][6]: Spearmint, Sequential Model-based Algorithm Configuration (SMAC), Tree Structured Parzen estimator (TPE) and so on. These three can be implemented using ‘Hyperopt’ library in Python. Research shows that overall, SPEARMINT performed best for the low-dimensional continuous problems [7][7]. For the higher-dimensional problems, which also include conditional parameters, SMAC and TPE performed better [8][8].

The machine intelligence approach advocated by Ayasdi (startup company founded by Stanford professors) combines topology with machine learning to achieve data-driven insights instead of hypothesis driven insights.  Machine learning on its own have significant limitations. Clustering, for example, requires an arbitrary choice of clusters which the analyst has to specify (hyper parameter tuning). With dimensionality reduction techniques, the danger is on missing the subtle insights included in the data that can potentially prove to be very useful to the analysis. Including topology with machine learning overcomes these drawbacks effectively [9].[9]

The topology visualizations capture the subtle insights in the data while also representing the global behavior of the data. From the nodes identified by the topology network diagrams from the data, clusters are identified and each cluster is fit onto a model that fits it more properly so that instead of a one-size-fit-all model, different models are applied to different regions of data for maximum predictive potency [10].[10] Perhaps this machine intelligence approach can also be applied on Deep Learning for optimizing hyper-parameters tuning.

Dark Knowledge [11][12][11],[12]

A very interesting recent development is Hinton’s use of ‘dark knowledge’ to characterise Deep Learning. Hinton claims that ‘dark knowledge’ is what most Deep Learning models actually learn. Hinton et al shows that classifiers built on softmax function have a lot of hidden information contained within them as well as the correlations in the softmax are very informative and mysterious. For instance, when training to classify man, woman and alien, the correlation between man and woman will always have more correlation than with man and alien or woman and alien as man and woman look more similar to each other being humans than an alien.

Hinton also highlights the research undertaken previously by Caruana et al [13][13] and tries to propagate it further. The main idea is that a complex model can be used to train a simpler model. Once a core Deep Learning model is trained, we do not need to train models again on the huge datasets. Instead, we can just posit the new ‘student’ model to mimic the ‘teacher’ model. This can lead to massive breakthroughs as far as our imagination can take us in applications as many Internet of Things components and smartphones do not have the memory or processing power to run their own Deep Learning models on the data that they have captured. They can instead rely on the ‘teacher’ model and still give us accurate results.  Ofcourse, the teacher model needs to be accurate for student model to be accurate as well and we’d still need large unlabelled data set to train the teacher model.

A very effective way to increase performance and accuracy of Deep Learning as well as any machine learning algorithm is to train many different models on the same data and then average their predictions. This ensemble of models, termed the teacher model is too time consuming, memory expensive and cumbersome to be practiced by large number of users. Once the teacher model is trained, the student model mimics the teacher model via ‘distillation’ of the knowledge. Instead of training the student model on the labelled data, we can train it against the posteriors of the teacher model. The output distributions from the softmax function is utilized which contains the dark knowledge. However, as these posterior estimates suffer from low entropy, transformation from logarithm that ‘raises the temperature’ is necessary to activate and transfer the dark knowledge to the student model.  Softened outputs reveal the dark knowledge in the ensemble as the soft targets contain almost all the knowledge.

Another way to harness the dark knowledge contained in Deep Learning is through specialist networks. Specialist network is a model that is trained to specifically identify and classify certain specialist classes. For instance, while a Deep Learning model can identify animals from humans, specialist network will aim to classify mammals from other animals and so on. A sufficient number of specialists can be trained on the data and their classifications averaged to arrive at accurate results rather than relying on a single generic Deep Learning model especially when data has a lot of classes and suffers heavily from class imbalance problems.

In Hinton’s revised approach, certain numbers of adjustments are made. Each data point is assigned a target that matches the temperature adjusted softmax output. These softmax outputs are then clustered multiple times using k-means and the resultant clusters indicate easily confuseable data points that come from a subset of classes. Specialist networks are then trained only on the data in these clusters using a restricted number of classes. They treat all classes not contained in the cluster as coming from a single “other” class. These specialist networks are then trained using alternating one-hot, temperature-adjusted technique. The ensemble method constructed by combining the various specialist networks creates benefits for the overall network.

[1] Mayo M, Larochelle H (Oct 2015) KDNuggets.com. Top 5 arXiv Deep Learning Papers explained.

[2] Mayo M, Larochelle H (Jan 2016) KDNuggets.com. 5 more arXiv Deep Learning Papers explained.

[3] Shang K, Hossen Z, (2013); CAS/CIA/SOA Joint Risk Management Section; Applying Fuzzy Logic to Risk Assessment and Decision-Making

[4] Risi, S. and Stanley, K. University of Central Florida; The ES-HyperNEAT Users Page

[5] Cao and Yang, 2015. IEEE symposium on security and privacy pgs 463-480. Towards making systems forget with machine unlearning.

[6] Katharina Eggensperger et al;Towards an Empirical Foundation for Assessing Bayesian Optimization of Hyperparameters

[7] J. Snoek, H. Larochelle, and R.P. Adams. Practical Bayesian optimization of machine learning algorithms. In Proc. of NIPS’12, 2012.

[8] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proc. of KDD’13, 2013.

[9] Ayasdi: Topology & Topological Data analysis

[10] Ibid

[11] Dark Knowledge: Hinton et al; Google Inc. Presentation.

[12] Distilling the Knowledge in a Neural Network: Hinton et al: arXiv:1503.02531v1 [stat.ML] 9 Mar 2015

[13] Caruna et al: Do deep nets really need to go deep? arXiv:1312.6184v7 [cs.LG] 11 Oct 2014


Leave a Reply

Your email address will not be published. Required fields are marked *