To get a cohesive view of “data science”, it is useful to trace the origins of tools and techniques used by its current practitioners. These techniques have primarily emerged out of the following four fields:

  1. Statistics
  2. Electrical Engineering
  3. Computer Science
  4. Econometrics

Here is a brief introduction to the tools coming from each of these fields.

1. Statistics

Historically, statistics has been divided into two schools of thought – frequentist and Bayesian. From frequentist statistics, we get the notion of t-stat and p-valuesBayesian reasoning is fundamental to many approaches in data analysis, both in model building (example: Naïve Bayes) and in hypothesis testing.

Modern statistical learning has given us concepts like ridge regression,LASSO and kernel methods. Even the notion of compressed sensing emerged first in the statistical learning literature, before being adopted and enhanced by the statistical signal processing community.

2. Electrical Engineering

Data science related concepts in EE primarily came from two sub-fields – discrete time signal processing and statistical signal processing.  The language of transforms (Fourierz, wavelet) and frequency selective filter designcome from discrete time signal processing. From statistical signal processing(which includes parameter detection and signal estimation), we get the Cramer-Rao bounds and Kalman/Particle/Wiener filters in addition to ML, MAP and MMSE estimators. Another important contribution to data science from EE is the famous Viterbi algorithm.

3. Computer Science

Machine Learning tools have their origins in research in Artificial Intelligence and Artificial Neural Networks. Tools like gradient descent, back propagation techniques, logistic regression, Rosenblatt’s perceptron, linear discriminant analysis, support vector machines, random forests, MCMC, principal component analysis/independent component analysis are today basic tools in the machine learning arsenal. We owe even the notions of reinforcement, supervised and unsupervised learning to the machine learning literature.  

Theoretical insights into data analysis tools come from Vapnik–Chervonenkistheory (example: VC dimension) in computer science.

4. Econometrics

Stationary time series in econometric studies have been modeled using AR (auto-regressive), MA (moving average) and ARMA models. For modeling heteroscedastic time series (with varying variance) we have the ARCH andGARCH models introduced initially by Engle and Bollerslev, respectively. The econometrics literature has also given us Fama-Macbeth regression andregime switching models. These techniques are being used in quantitative asset management.


The common mathematical infrastructure that ties all these fields together includes probability, stochastic processes, ergodic theory, linear algebra. From the computer science perspective, we draw upon algorithm design, including topics from graph theory and dynamic programming. At a more abstract level, machine learning concepts are often explained by references to information theory.

Acknowledgements: I compiled this post with inputs from Dr. Rajesh T Krishnamachari.

Views: 4467


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Michael Clayton on January 1, 2017 at 11:41am

Nested VC models for continuous data are heavily used in semiconductor industry.

JMP Variability Plot has cross or nested options, and automatically picks Anova, EML, or Bayes approach depending on structure of data.  What do you recommend to show "% variance explained" by each component of variance (eg,  lot to lot, wafer to wafer, die to die and perhaps product to product at top.)?

Comment by Manoel Gladino on February 4, 2016 at 11:21am

Logistic regression was used first in statistics. MCMC is from Statistics (or maybe Physics) also. I think PCA was created by Pearson (also statistcs), and widely used in the psychometrics literature.


1. Origin Logistic Regression: http://papers.tinbergen.nl/02119.pdf

2. Origin MCMC: http://www.stat.ufl.edu/archived/casella/Papers/MCMCHistory.pdf

Comment by Bala R Subramanian on July 31, 2015 at 7:51am

Interesting analysis. It is often difficult to track the way biological thought processes occur and their outcomes in tools, techniques and applications. It may be years before someone might apply a tool or a technique from one filed of application to an entirely different one. So, what appears to be true at a given time often changes to not necessarily so, some time latter. Of course, that is why Wikipedia is constantly getting updated. 

Comment by Roman Gavuliak on July 20, 2015 at 7:59am

Hello Srividya,

I made my comment, because you say the article is about the origin of those techniques. There is no doubt these techniques nowadays are all around the place and the border between ML, Data Mining or Statistics is becoming more and more blurred.

Comment by Bellur Srikar on July 20, 2015 at 6:45am
Hi Srividya, I think you should also include Operations Research which has contributed a lot to algorithms etc. that you have attributed to CS and AI.
Comment by Srividya Kannan Ramachandran on July 20, 2015 at 5:51am

Hello Roman, Thank you for your comment. You are correct, logistic regression and LDA have their origins in Statistics, but they have become fundamental techniques in the ML field.

Comment by Roman Gavuliak on July 19, 2015 at 11:11am

Many of the techniques you attribute to Computer Science come too from statistics. This includes Logistic Regression, Discriminant analysis (based on Fisher's linear discriminant) and Principal Components Analysis (invented by Pearson). I think there is little doubt that both Pearson and Fischer were statisticians.

Comment by Dr. Vijay Srinivas Agneeswaran on July 16, 2015 at 10:07pm

Quite interesting and relevant! Thanks for posting this.

Comment by Milton Labanda on July 16, 2015 at 8:42am

Interesting !

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service