Interesting to read what some statisticians write about data science, on the American Statistical Association (ASA) blog. Most of us don't care about our job title - there are so many breeds of statisticians and data scientists after all - and they do overlap to some extent. While I was once a statistician, I now call myself data scientist or business scientist. Anyway, below are some extracts from very lively and interesting discussions taking place on the ASA blog.
Tommy Jones posted The Identity of Statistics in Data Science on the American Statistical Association (ASA) website in December 2015. In his long and very interesting article, he wrote (this is just a tiny extract):
Judging by current statistics curricula, statistics is more closely tied to the mathematics of probability than to fundamentals of data management.[...] As models have become more accurate, they have also become more complex.
Dogling Yan commented:
In that data analyst job, I barely used any statistical models because people don’t really care about p-values. Also, with the size of current datasets, p-values are always very small. The models, analysis methods that most people learned at school are not very useful since the simple model and more valid and complex models tend to give the same conclusion when sample size is large.
As a data scientist, I work on making models (actually, absence of models, but instead data-driven systems) simpler, not more sophisticated, and fit for black-box processing of big data in production mode. That is, robustness is more important than 100% accuracy, especially if your data is 70% accurate. And also, I work on designing a new statistical framework that is free of mathematics, traditional probability theory, random variables, and so on - so that anyone who know Excel can learn it. Even to compute confidence intervals or more elaborate forecasting systems. It will be published in my upcoming book, Data Science 2.0.
Jennifer Lewis Priestley also posted on ASA, in January 2016: Data Science: The Evolution or the Extinction of Statistics?
In this article, she wrote:
While data scientists can do a great many things I can’t do—mainly in the areas of coding, API development, web scraping, and machine learning—they would be hard pressed to compete with a PhD student in statistics in supervised modeling techniques or variable reduction methods.
Read my article about a fast, efficient, combinatorial algorithm for feature selection using predictive power to jointly select variables. It is the data science approach to variable reduction and variable generation. Likewise, supervised modeling - which it also belongs to machine learning - is not foreign to data scientists. Read about my automated indexation/tagging algorithm, used for taxonomy creation/maintenance or cataloguing: it performs clustering of n data points in O(n), and can cluster billions of web pages in very little time. It is also used to turn unstructured data into structured data.
And my reply to someone (Peter) who commented on LinkedIn, saying that "the feature selection method mentioned in the blog is still a heuristic method i.e. no guarantee to find the optimal subset of variables."
Peter, data scientists are usually interested in local optima, easy to detect, and that provide almost the same yield as the global optimum which has two drawbacks: (1) the global optimum could be an unstable optimum, and (2) it might take far more time to compute if the data set is immense.
About the author: Vincent Granville worked for Visa, eBay, Microsoft, Wells Fargo, NBC, a few startups and various organizations, to optimize business problems, boost ROI or to develop ROI attribution models, developing new techniques and systems to leverage modern big data and deliver added value. Vincent owns several patents, published in top scientific journals, raised VC funding, and founded a few startups. The most recent one - Data Science Central - is growing exponentially, and delivers a substantial profit margin. Vincent also manages his own self-funded research lab, focusing on simplifying, unifying, modernizing, automating, scaling, and dramatically optimizing statistical techniques. Vincent's focus is on producing robust, automatable tools, API's and algorithms that can be used and understood by the layman, and at the same time adapted to modern big, fast-flowing, unstructured data. Vincent is a post-graduate from Cambridge University.