Limitations of Deep Learning and strategic observations

While Deep Learning has shown itself to be very powerful in applications, the underlying theory and mathematics behind it remains obscure and vague. Deep Learning works, but theoretically we do not understand much why it works. Some leading machine learning theorists like Vladimir Vapnik criticise Deep Learning for its ad-hoc approach that gives a strong flavour of brute force rather than technical sophistication. Deep Learning is not theory intensive; it is empirical based more (hence causing battle of viewpoints between empiricism and realism) and relies on clever tweakings [1].[1] This is why ‘Deep Learning’ is viewed as a black box and why we preferred to use Theano instead of other packages as it allowed us better view inside the workings of the model (which is still not enough to fully overcome the black box criticism).

Furthermore, there are no theoretical guarantees with Deep Learning. There is no theoretical proof that once we start randomly at one point and use stochastic gradient descent that we will achieve convergence and not be trapped in local minima. Other machine learning algorithms like logistic regression is guaranteed to converge over time and linear regression generates exact solutions [2].[2]

I believe that Deep Learning, for its fundamental composition, resembles complexity science tools. This is as no mathematical maturity of advanced level is sought in Deep Learning unlike other many machine learning algorithms. Heuristics based learning and tweaks and experiments are engineering in Deep Learning on huge amounts of data with microscopic focus instead of generic focus. Thus, simple rules of Deep Learning interact with sophisticated data and break them into microscopic segments and then recombine them as outputs. Complexity Science adheres that complex systems can arise from massive iterations and interactions between agents and data following very simple rules and heuristics. Deep Learning starts from the bottoms-up and neural networks are even differentiable from top to bottom via back-propagation which makes it possible to trace the contribution of each input to the output and show which inputs will move the output most if slightly increased or decreased.

Also it must be emphasized that with sufficiently large data sets, the risk of over-fitting is low. Plus, we have so many algorithms like dropout for handling over-fitting. Validating the algorithm on testing data provides further accuracy and sanity checks. Parallel computing and increase in computing size means that we do not have to have tightly closed hypothesis to check on small data as done in statistics traditionally. We can follow up now with many hypothesis simultaneously with empirical experiments and go more microscopic on high dimensional data sets. When we arrive at the frontier of our formal understandings, intuition based Deep Learning can offer a path forward. There is always the chance that with continued research, we might be able to decode these black boxes in the future.

Deep Learning has been followed around with great hype by the media and spectators. First, the wave of hype posits Deep Learning as an ‘apocalyptic’ algorithm but recently this has turned around with negative opinions. It is over-reaction on both up and down sides by the media, but researchers are aware that first a new algorithm gains prominence from its ability to add unique value addition and that over time, further research is bound to show faults in algorithms. This is not a weakness but is a strength as researchers can then focus on overcoming those specific limitations to make such algorithms more powerful in the future.

Recent research papers highlighting limitations of Deep Learning are [3][3]:

‘Intriguing Properties of Neural Networks’ by Google’s Christian Szegedy and others.
‘Deep Networks are Easily Fooled’ by Anh Nguyen at University of Wyoming

The first paper shows that one can subtly alter images and these changes cannot be detected by humans yet these lead to misclassification in a trained Convolutional Neural Network. In giving this case its proper context, it is important for us to note that almost machine learning algorithms are susceptible to such adversarial chosen examples as done here. Value of a particular feature can be deliberately set very high or very low to induce misclassification in logistic regression. If some features have a lot of weight, even minor change can lead to misclassification. Similarly for decision tress, a single binary feature can be used to direct an example at the wrong partition by simply switching it at the final layer.

Hence, the proper context is that even machine learning models with theoretical guarantees and deep mathematical formulations are susceptible so such manipulations so we should not be shocked that Deep Learning too is vulnerable.

The second paper by Anh Nguyen discusses the opposite case. They create gibberish images to train gradient descent which then gets classified strongly into classes even though they should not have been classified in any class. This too is a genuine limitation of Deep Learning, but it’s a drawback for other main machine learning algorithms as well.

Strategic Observations

The more data we have, the more likely we are to drown in it. ~ Nassim Nicholas Taleb; Fooled by Randomness (2001)

It is no accident that serial best experts preferred qualitative, context-specific explanation while generally using models and statistics to beat everyone else at judging (herds use models and stats).

Unfortunately, not all datasets can be made large just because storage is cheap and our storage items lack reasonable privacy controls. Rare diseases remain rare and rare events remain rare. To add maturity to our identification of such rare events, qualitative profiling and insight is compulsory and not just modeling on its own. Just because something formerly couldn’t be measured didn’t make it irrelevant. Recall Kant’s, Jung’s, Berlin’s, Einstein’s and Goethe’s “beyond analysis” critique and advice: Intuition—experience and familiarity—links knowledge to understanding [4].[4]

Along the same vein, In The Black Swan, Taleb describes “Mediocristan,”(Quadrants I and II) as a place where Gaussian distributions are applicable. By contrast, he calls Quadrant IV “Extremistan.” It is Extremistan where we are interested for understanding complex systems. Actuaries like to build their models on the Gaussian distribution we are perhaps avoiding professional expertise by fooling ourselves by retreating to the comfort and safety of the womb of Mediocristan instead of facing Extremistan in all its unknown mystery and ambiguity [5].[5]

To avoid being ambiguity averse, we can train ourselves to explore the unexplored. As actuaries, perhaps we could make a greater effort to uncover hidden patterns. Actuarial and statistical modeling is a double-edged sword. If applied correctly, it is a very powerful and effective tool to discover knowledge in data, but in the wrong hands it can also be distorted and generate absurd results. It is not only our results that can be absurd, but our risk-averse and ambiguity-averse mentalities as well [6].[6] As Voltaire said “doubt is not a pleasant condition but certainty is absurd.”

Aristotle explains this further: “It is the mark of an instructed mind to rest satisfied with that degree of precision which the nature of the subject limits, and not to seek exactness where only an approximation of the truth is possible.”

This teaches us that we should be aware that precision implies confidence. We must be very alert to not fall into this trap. While point estimates are often required (we have to quote and file a specific premium), there are many cases where ranges of estimates are more appropriate. While statistical techniques can sometimes be used to generate precise confidence intervals, mostly statistical rigor is not possible or even necessary for emerging risks. By discussing a range of estimates, actuaries can provide more value to their stakeholders by painting a more complete picture of the potential impacts of decisions related to emerging liabilities [7].[7]

Finally, we must ensure that actuarial output highlights fundamental questions at hand to stakeholders instead of confusing them with complicated numbers and lack of decisiveness. There is obviously a premium to be established but the management running the company does not care what the actual premium is—they need to know the likely impacts of that premium on the business. From a financial perspective we should avoid saying that we’ve priced for a certain margin because that exact margin is, in the end, going to be exactly wrong! The better approach would be to explain the range of possible outcomes and the impacts of each [8].[8] As Nassim Nicholas Taleb explains: “There are so many errors we can no longer predict, what you can predict is the effect of the error on you!”

[1] Kdnuggets Oct 2015. Lipton, Z.C “Does Deep Learning come from the Devil?”

[2] Kdnuggets Jul 2015. Lipton, Z.C:”Deep Learning and the Triumph of Empiricism”

[3] Kdnuggets Jan 2015. Lipton, Z.C: “(Deep Learning’s Deep Flaws’)s Deep Flaws”

[4] Werther; SOA 2013; Recognizing When Black Swans Aren’t: Holistically Training Management to Better Recognize, Assess and Respond to Emerging Extreme Events

[5] Mills, A. SOA Predictive Analytics and Futurism Newsletter; Issue 1, 2009. Should Actuaries Get Another Job? Nassim Taleb’s Work And Its Significance For Actuaries

[6] Ibid

[7] Hileman, G. SOA Predictive Analytics and Futurism Newsletter; Issue 9, 2014. “Roughly Right”.

[8] Ibid

Limitations of Deep Learning and strategic observations

Leave a Reply Cancel reply