.

# What statisticians think about data scientists

Interesting to read what some statisticians write about data science, on the American Statistical Association (ASA) blog. Most of us don't care about our job title - there are so many breeds of statisticians and data scientists after all - and they do overlap to some extent. While I was once a statistician, I now call myself data scientist or business scientist. Anyway, below are some extracts from very lively and interesting discussions taking place on the ASA blog.

Tommy Jones posted The Identity of Statistics in Data Science on the American Statistical Association (ASA) website  in December 2015. In his long and very interesting article, he wrote (this is just a tiny extract):

Judging by current statistics curricula, statistics is more closely tied to the mathematics of probability than to fundamentals of data management.[...] As models have become more accurate, they have also become more complex.

Dogling Yan commented:

In that data analyst job, I barely used any statistical models because people don’t really care about p-values. Also, with the size of current datasets, p-values are always very small. The models, analysis methods that most people learned at school are not very useful since the simple model and more valid and complex models tend to give the same conclusion when sample size is large.

My comment:

As a data scientist, I work on making models (actually, absence of models, but instead data-driven systems) simpler, not more sophisticated, and fit for black-box processing of big data in production mode. That is, robustness is more important than 100% accuracy, especially if your data is 70% accurate. And also, I work on designing a new statistical framework that is free of mathematics, traditional probability theory, random variables, and so on - so that anyone who know Excel can learn it. Even to compute confidence intervals or more elaborate forecasting systems. It will be published in my upcoming book, Data Science 2.0.

Jennifer Lewis Priestley also posted on ASA, in January 2016: Data Science: The Evolution or the Extinction of Statistics?

While data scientists can do a great many things I can’t do—mainly in the areas of coding, API development, web scraping, and machine learning—they would be hard pressed to compete with a PhD student in statistics in supervised modeling techniques or variable reduction methods.

My comment:

Read my article about a fast, efficient, combinatorial algorithm for feature selection using predictive power to jointly select variables. It is the data science approach to variable reduction and variable generation. Likewise, supervised modeling - which it also belongs to machine learning - is not foreign to data scientists. Read about my automated indexation/tagging algorithm, used for taxonomy creation/maintenance or cataloguing: it performs clustering of n data points in O(n), and can cluster billions of web pages in very little time. It is also used to turn unstructured data into structured data.

And my reply to someone (Peter) who commented on LinkedIn, saying that "the feature selection method mentioned in the blog is still a heuristic method i.e. no guarantee to find the optimal subset of variables."

Peter, data scientists are usually interested in local optima, easy to detect, and that provide almost the same yield as the global optimum which has two drawbacks: (1) the global optimum could be an unstable optimum, and (2) it might take far more time to compute if the data set is immense.

About the author: Vincent Granville worked for Visa, eBay, Microsoft, Wells Fargo, NBC, a few startups and various organizations, to optimize business problems, boost ROI or to develop ROI attribution models, developing new techniques and systems to leverage modern big data and deliver added value. Vincent owns several patents, published in top scientific journals, raised VC funding, and founded a few startups. The most recent one - Data Science Central - is growing exponentially, and delivers a substantial profit margin. Vincent also manages his own self-funded research lab, focusing on simplifying, unifying, modernizing, automating, scaling, and dramatically optimizing statistical techniques. Vincent's focus is on producing robust, automatable tools, API's and algorithms that can be used and understood by the layman, and at the same time adapted to modern big, fast-flowing, unstructured data. Vincent is a post-graduate from Cambridge University.

DSC Resources

Views: 46500

Comment

Join Data Science Central

Comment by Carlos Aya on February 11, 2016 at 9:15am

Carlos, mathematics has been enriched over 2000 years by looking at data from real problems. Topology is just an abstraction that has been proved useful. Is it possible to find new concepts in topology (or statistics or maths) by looking at "application data coming from users all around the world"? Absolutely! For example, I came across a paper describing wavelet approximations in network graphs (to name 1 example), and there are countless.

To put it simply: mathematics is the true data science... it has been, always. That is why physics, chemistry, building, engineering, mechanics, computer science and everything today is based on it.

It is a huge huge toolset, and is still growing... "Nothing new"? Really? You have to read a little bit more, although I understand it is overwhelming even for full time professionals.

Comment by Carlos Quijano San Martín on February 9, 2016 at 2:01pm

Thanks for the feedback, after trying to write you a good reply I desist. It is very difficult to do. I will try anyway but in a very short way: Do you imagine yourself discovering new concepts in topology by researching application data coming from users all around the world? That is just the reverse to applying current topological techniques to characterize users behaviour, that is what mathematicians sells us as Data Science. For me, this is not Data Science. This is statistics applied to Big Data. Nothing new. Hope my comment helps, my wife is claiming I should be in bed already, and thats a fact too.

Comment by Carlos Aya on February 9, 2016 at 10:14am
Carlos, it seems you are not supporting your claims with data. Sad.

Comment by Sione Palu on February 9, 2016 at 10:09am

Quote :  ""I think that the new self appointed data scientists will learn a lot from what we did in the past with high success and may be able to apply similar techniques now, but in other fields that require them.

Data science is a multi disciplinary domain that itself includes bioinformatic and the idea that bioinformatic field is somehow the epitome of data science is nonsense. Sure there are new analytical techniques originated in bioinformatics but it is no different from development of new analytical techniques in other domains which contribute to data science as a whole.  Each discipline under the umbrella of data science do contribute to advance of data science & techniques from each discipline tend to be adopted across the board in the field of analytics / datascience.

A good example is that an engineer who had been specialized in designing dsp (digintal signal processor) but no prior experience is bioinformatic can pretty much learn to analyze bioinformatic data by using knowledge he already possessed from his domain in electronic engineering (eg,  signal processing wavelet techniques) .

"Wavelet-based detection of transcriptional activity on a novel Staphylococcus aureus tiling microarray"

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-21...

So, data scientists can learn alot from just following Data Science Central's posted by its authors, since it gives direction to practitioners of where/when new techniques are available in open sources & the know how in using those techniques via tutorials.  No need to dig into bioinformatics unless a datascientist wants to apply his/her knowledge in that domain.

Comment by Carlos Quijano San Martín on February 9, 2016 at 5:29am

I worked in bioinformatics 10 years before moving to financial systems. I recommend you to be in contact with bioinformaticians, not mathematicians or staticians, if you want to know how a data scientist should be like. I also recommend start reading papers about genomic data analysis, you dont need to understand the biological part, but understand de data problem. And if you learn a little molecular biology, so much the better. There is no other scientific corner where data science has been applied so well as it's been in Molecular Biology during the genomic revolution. I think that the new self appointed data scientists will learn a lot from whar we did in the past with high success and may be able to apply similar techniques now, but in other fields that require them.

Comment by Carlos Quijano San Martín on February 9, 2016 at 5:12am

In my oppinion Data Science is the scientific field that research how to manage and extract new knowledge from big data sources. It is only 10% statistics 10% mathematics and 10% informatics, with a 70% new scientific techniques yet undiscovered, that we need to start discovering. This 70% new thecniques will change the world, afecting statistics, mathematics and informatics in a way they will never be the same. Thats because it is a new scientific field. Data scientists exists since aprox 20 years ago, when big data sources started to be developed by the auge of informatics.

Comment by Carlos Aya on January 19, 2016 at 1:36pm

Sione, wavelet theory was sourced by the work of physicist, engineers and mathematicians - mainly those working on PDE and approximation theory. Being within the realm of mathematics, statisticians naturally pick them up :)

The intro on this book has a relatively good history of the developments that let to its current shape. But beware, it is still in the making, as multiple dimensions have their own challenges.

Comment by Sione Palu on January 19, 2016 at 12:50pm

David Donoho and his group are awesome statisticians.

I have used their Matlab Wavelet tookit called Wavelab from Stanford many times in the past for certain data-analytics tasks I've done, especially temporal data.

"Wavelab"

http://statweb.stanford.edu/~wavelab/

Wavelet researches had been pre-dominantly physicists & engineers (signal processing engineers) tool for a long time, but the it has now been picked up in different disciplines from NLP, Statistics, Machine-Learning & other analytics' based fields.

As I said in my previous comments that the boundary between data science (which is relatively new term) & statistics is blurred in recent years, therefore, data-scientists should respect statisticians & stop bashing the whole field & professions of statistics.

Comment by Carlos Aya on January 19, 2016 at 12:18pm

You all should read D. Donoho's "50 years of data science" too.

http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

Comment by Sione Palu on January 15, 2016 at 2:34pm

The boundary between data science & statistics is blurred. Each disciplines should respect each others' field because they hugely overlap.