You sometimes hear from some old-fashioned statisticians that data scientists know nothing about statistics, and that they - the statisticians - know everything. Here we prove that actually it is the exact opposite: data science has its own core of statistical science research, in addition to data plumbing, statistical API's, and business / competitive intelligence research. Here we highlight 11 major data science contributions to statistical science. I am not aware of any statistical science contribution to data science, but if you know one, you are welcome to share.
Here's the list:
All this research is available for free.
DSC Resources
Additional Reading
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Comment
Well and diplomatically said Prof Hart
Dear Vincent,
Having served as a Statistician analysing Data for 40 years, there is a touch of arrogance in the statement that Statisticians "... they know everything". For the lucid Data Scientists, is you permit,safely considering one myself, it is indeed a contradiction, since Statisticians (whether Bayesians or non-Bayesians) know very well that the probability of "knowing everything" is zero. So, perhaps I should take the prior of Carlos to inform my predictive probability that you made those statements with "tongue in cheek" to spurn a discussion. If the conditional is true, then well done: you succeeded; if it's false, then you should seriously revisit your statements, back them up with scientific evidence and perhaps a good starting point is the works of Sir Harold Jeffreys
Cheers and take care
Dr Vincent Micali
MSc (Warwick), PhD (UFS)
Dear Vincent,
"unaware of any statistical science contribution to data science" -> Tongue in cheek, right :) ?
I do acknowledge that data abundance and (more importantly) the universal availability of computers have posed a tremendous challenge to mathematics and statistics. Anyone can modify an existing algorithm that fails, and "make it work" for his/her particular case - or even invent "new" ones.
But this does not mean this is science... in the sense that showing that it works, even in many situations, does not explain why they work, and how they could fail (i.e. what underlying assumptions are required for it to work).
Yes, it is published as research - but, believe me, it is more an _unsolved problem_ for a mathematician or statistician than well founded finished research.
If you allow me the analogy, is like that in ancient Babylon it was common knowledge among builders that certain Pythagorean triangles existed - but it required formal geometry to explain what was really going on.
So, you want recent contributions? Here one: search for "functional data analysis" in google scholar ... enjoy :)
Anyway, data scientists and everybody ... yes, keep using computers, your challenges are welcome (but even better if you join the "theoretical" camp and help to explain why...)
Kind regards
Carlos
G'day Vincent:
as always I find your comment both interesting and insightful. I have always regarded myself as a data-analyst – even before the modern idea was invented [I'm 80]. I have also applied statistical reasoning for the past 60 years to data analysis. I do agree that data scientists have contributed significantly to statistics in the sense I understand the field. However, to be 'unaware of any statistical science contribution to data science' must have been ghost-written. It is not you as I have read you over the past few years! I assume you are being provocative simply to get something going – which is my own method of teaching.
Mathematical methods have nothing to do with it. The underlying concepts of statistics are what statistical science has contributed to data analysis.
Start with:
“ what is the chance of a random sample taken from a location being simply a variant within the population of interest, or alternatively, that it is from a totally different population”.
Or the application of basic functions:
“mean(), median(), sd(), var(), min(), max(), range(), summary() , sort(), order(), rank() , exp(), log(), sin(), cos(), tan() [radians] , length() , rev() , sum(), cumsum(), prod(), cumprod(), round(), ceil(), floor(), signif() , which(), which.max() , any(), all(), and mode()”
Or the basic model:
“Y = (something) + (error of measurement), where Y is said to be the dependent variable that is being measured, and (something) is some relationship among the so-called independent variables that control or predict Y”
Or
“ rejecting the hypothesis vs failing to reject the hypothesis”.
“To judge the reliability of any experimental result it must be compared with an estimate of it's error i.e. a test of significance. The test of significance separates the subjective guess from fact [more correctly the failure to reject a hypothesis pertaining to a fact].”
Or:
“The innate control of error by multiple replication” This provides a major advantage to, and is a principal reason for, the success of modern 'big data' analysis. It leads to theridea that in data analysis we are dealing with the total population not a statistical sample [we both know that is not true but it is suffice to justify what is done].
I could go on but you and most of your reader know this stuff already. Data analysis has grown but it still have the underpinnings of statistical analysis. For the future of statistical analysis I advise keeping a close-eye on deep learning methods.
Luv and kisses as always,
George Hart,
Professor emeritus,
LSU.
There's a good article on random number generation by Prof. Cleve Moler from MathWorks here:
© 2016 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central