To get a cohesive view of “data science”, it is useful to trace the origins of tools and techniques used by its current practitioners. These techniques have primarily emerged out of the following four fields:
Here is a brief introduction to the tools coming from each of these fields.
Historically, statistics has been divided into two schools of thought – frequentist and Bayesian. From frequentist statistics, we get the notion of t-stat and p-values. Bayesian reasoning is fundamental to many approaches in data analysis, both in model building (example: Naïve Bayes) and in hypothesis testing.
Modern statistical learning has given us concepts like ridge regression,LASSO and kernel methods. Even the notion of compressed sensing emerged first in the statistical learning literature, before being adopted and enhanced by the statistical signal processing community.
Data science related concepts in EE primarily came from two sub-fields – discrete time signal processing and statistical signal processing. The language of transforms (Fourier, z, wavelet) and frequency selective filter designcome from discrete time signal processing. From statistical signal processing(which includes parameter detection and signal estimation), we get the Cramer-Rao bounds and Kalman/Particle/Wiener filters in addition to ML, MAP and MMSE estimators. Another important contribution to data science from EE is the famous Viterbi algorithm.
3. Computer Science
Machine Learning tools have their origins in research in Artificial Intelligence and Artificial Neural Networks. Tools like gradient descent, back propagation techniques, logistic regression, Rosenblatt’s perceptron, linear discriminant analysis, support vector machines, random forests, MCMC, principal component analysis/independent component analysis are today basic tools in the machine learning arsenal. We owe even the notions of reinforcement, supervised and unsupervised learning to the machine learning literature.
Theoretical insights into data analysis tools come from Vapnik–Chervonenkistheory (example: VC dimension) in computer science.
Stationary time series in econometric studies have been modeled using AR (auto-regressive), MA (moving average) and ARMA models. For modeling heteroscedastic time series (with varying variance) we have the ARCH andGARCH models introduced initially by Engle and Bollerslev, respectively. The econometrics literature has also given us Fama-Macbeth regression andregime switching models. These techniques are being used in quantitative asset management.
The common mathematical infrastructure that ties all these fields together includes probability, stochastic processes, ergodic theory, linear algebra. From the computer science perspective, we draw upon algorithm design, including topics from graph theory and dynamic programming. At a more abstract level, machine learning concepts are often explained by references to information theory.
Acknowledgements: I compiled this post with inputs from Dr. Rajesh T Krishnamachari.