To get a cohesive view of “data science”, it is useful to trace the origins of tools and techniques used by its current practitioners. These techniques have primarily emerged out of the following four fields:
Here is a brief introduction to the tools coming from each of these fields.
1. Statistics
Historically, statistics has been divided into two schools of thought – frequentist and Bayesian. From frequentist statistics, we get the notion of t-stat and p-values. Bayesian reasoning is fundamental to many approaches in data analysis, both in model building (example: Naïve Bayes) and in hypothesis testing.
Modern statistical learning has given us concepts like ridge regression,LASSO and kernel methods. Even the notion of compressed sensing emerged first in the statistical learning literature, before being adopted and enhanced by the statistical signal processing community.
Data science related concepts in EE primarily came from two sub-fields – discrete time signal processing and statistical signal processing. The language of transforms (Fourier, z, wavelet) and frequency selective filter designcome from discrete time signal processing. From statistical signal processing(which includes parameter detection and signal estimation), we get the Cramer-Rao bounds and Kalman/Particle/Wiener filters in addition to ML, MAP and MMSE estimators. Another important contribution to data science from EE is the famous Viterbi algorithm.
3. Computer Science
Machine Learning tools have their origins in research in Artificial Intelligence and Artificial Neural Networks. Tools like gradient descent, back propagation techniques, logistic regression, Rosenblatt’s perceptron, linear discriminant analysis, support vector machines, random forests, MCMC, principal component analysis/independent component analysis are today basic tools in the machine learning arsenal. We owe even the notions of reinforcement, supervised and unsupervised learning to the machine learning literature.
Theoretical insights into data analysis tools come from Vapnik–Chervonenkistheory (example: VC dimension) in computer science.
4. Econometrics
Stationary time series in econometric studies have been modeled using AR (auto-regressive), MA (moving average) and ARMA models. For modeling heteroscedastic time series (with varying variance) we have the ARCH andGARCH models introduced initially by Engle and Bollerslev, respectively. The econometrics literature has also given us Fama-Macbeth regression andregime switching models. These techniques are being used in quantitative asset management.
The common mathematical infrastructure that ties all these fields together includes probability, stochastic processes, ergodic theory, linear algebra. From the computer science perspective, we draw upon algorithm design, including topics from graph theory and dynamic programming. At a more abstract level, machine learning concepts are often explained by references to information theory.
Acknowledgements: I compiled this post with inputs from Dr. Rajesh T Krishnamachari.
Comment
Nested VC models for continuous data are heavily used in semiconductor industry.
JMP Variability Plot has cross or nested options, and automatically picks Anova, EML, or Bayes approach depending on structure of data. What do you recommend to show "% variance explained" by each component of variance (eg, lot to lot, wafer to wafer, die to die and perhaps product to product at top.)?
Logistic regression was used first in statistics. MCMC is from Statistics (or maybe Physics) also. I think PCA was created by Pearson (also statistcs), and widely used in the psychometrics literature.
References.
1. Origin Logistic Regression: http://papers.tinbergen.nl/02119.pdf
2. Origin MCMC: http://www.stat.ufl.edu/archived/casella/Papers/MCMCHistory.pdf
Interesting analysis. It is often difficult to track the way biological thought processes occur and their outcomes in tools, techniques and applications. It may be years before someone might apply a tool or a technique from one filed of application to an entirely different one. So, what appears to be true at a given time often changes to not necessarily so, some time latter. Of course, that is why Wikipedia is constantly getting updated.
Hello Srividya,
I made my comment, because you say the article is about the origin of those techniques. There is no doubt these techniques nowadays are all around the place and the border between ML, Data Mining or Statistics is becoming more and more blurred.
Hello Roman, Thank you for your comment. You are correct, logistic regression and LDA have their origins in Statistics, but they have become fundamental techniques in the ML field.
Many of the techniques you attribute to Computer Science come too from statistics. This includes Logistic Regression, Discriminant analysis (based on Fisher's linear discriminant) and Principal Components Analysis (invented by Pearson). I think there is little doubt that both Pearson and Fischer were statisticians.
Quite interesting and relevant! Thanks for posting this.
Interesting !
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central