Subscribe to DSC Newsletter

Some statisticians have a biased view on data science

Most statisticians are great professionals, working on various data-intensive projects, and they don't care about their job title. You can say the same about data scientists, and me in particular. However, there is a small cluster of statisticians - Andrew Gelman seems to be their leader and their only influencer - who have been challenging us, even publicly insulting us recently.

So my criticism here applies only to this small clique of practitioners, most of them being Ph,D. statisticians who have worked for the same non-profit organizations for a long long time, and with limited experience and exposure to the real world. Some great statisticians such as Diego Kuonen, while strongly and justly defending statistical science, don't belong to that clique.

This incident started a few months ago, when these die-hard statisticians claimed that we, data scientists, know nothing about statistics, and that they know everything. Their science is an arcane mix of thousands of non-unified techniques that are kept almost secret, so as they can continue doing their costly man-made analyses with no regard to ROI. Many consider applied statistics as a plague. 

They even claimed that I do not know anything about statistics. That was the starting point. I acknowledged that indeed, I did not know anything, because my definition of statistical science is totally different from theirs: it's all about automation, model-free predictions, and big data applications, safe to use by the non-initiated..

Now they've changed their mind, and they claim that actually, what I do is statistical science. Yet their old statistics is a small portion of data science: business hacking, domain expertise, machine learning, data engineering, new statistics and core data science being the main components of what I do. But they even went as far as to say that my model-free confidence intervals were wrong, when it was proved to be equivalent to the old statistical method. Likewise, they claimed that my Jackknife regression was an old technique developed by Bradley Efron. Yet it has nothing to do with Efron nor re-sampling. Andrew Gelman himself claimed that I stole ideas from his research (it is a classic syndrome for famous academic statisticians, they believe that they are the only ones having original ideas, and that anything remotely close to what they do is plagiarism).

What a bunch of arrogant, close-minded people! I don't even read Andrew Gelman's publications. They are disseminated to a very small audience in obscure journals that pretty much no mainstream people read, and it's written in convoluted English. I might sometimes re-invent statistics, but it is easier, faster, and better than wasting days looking after old publications and digesting / translating / adapting to modern data.

What these few statisticians don't understand is that these journals are no longer the outlet for many modern scientists, including me. Just compare my article on data videos with one published by a traditional scientist, in a top traditional journal, independently and at the same time. You will see that mine is far more useful, provide code to make much faster, longer videos, and is in essence, of superior quality. You may disagree, and you are welcome to say so in the comment section below. Not that I tried to submit and got rejected, I actually never send my material to these journals anymore. I no longer have time for this, nor to write unpaid book or peer-reviews. Their motto is publish or perish (and they must please their grantors, so innovation is dangerous for them), while my motto is bring value, make it simple

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 8711

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Dalila Benachenhou on January 28, 2016 at 1:53pm

I'm a statistician and a computer scientist (degrees in both field.)  Still my professors in CS had physics background.  What I have learned, like in any field, many CS have little background in data modeling (Most of my friends are it IT, while I was in AI), and some statisticians are in Clinical Trial.  You have specific problem and you have a protocol to follow.  

 I also met many physicists whose concentration is in Machine Learning and few in black matter.  In fact many discoveries and new data modeling approaches were developed by physicists, Boltzman Machine anyone.  So it all depends who you talk to and what their concentration is.  

Here is my one minute degree in statistics:

  • Fewer independent predictors always better than many dependent predictors.
  • Minimize your bias and your variance.  In reality, in many cases you will be able to do only one or the other.  Although, I saw a proof stating that Random Forest classifier does both.  
  • 0 correlation doesn't imply independence
  • Median better representative of data when their distribution is not normal
  • Median is more robust to outliers than mean.
  • Cauchy distribution has no mean and no variance
  • Without Central Limit Theorem many parametric theories and approaches will cease to exist.
  • The only distribution you need is Normal Distribution, with large enough sample,  all the others converge to it.  How do I know this, I had to do the proof when taking Math Stat.
  • I left one or two others but for another time.  
Comment by Sione Palu on January 3, 2015 at 1:25pm

I don't have much beef with statisticians but I have come across 2 or 3 statisticians in the past who seemed very arrogant to me. They indirectly implied during our conversations that because I didn't formally train in statistics when they asked about my background (my area is physics), then they are the overlord of statistical data analytics and me not. Its the sort of meetings you encountered frequently when you attend big data user group seminars or conferences, etc...  I reminded those individuals that statistical theories & algorithms are not the sole domain of statisticians where they hold monopoly claims to it but it spans very wide from multi-disciplinary fields (computing & machine learning, pure & applied math, engineering, physics, etc,...), for example, the Monte Carlo method was originally developed by physicists (Jon von Neumann & colleagues) during the secret Manhattan Project to build the US atom bomb in world war 2. That's just one example out of many techniques available today that originated from other disciplines but not from statistics.

Here's a fact. Majority of Statisticians only read related relevant statistical journals but they hardly branch out to read  quantitative journals in other fields (physics, pure math, engineering, computing and so forth). This limits their knowledge to what's new in statistical theories that are currently being advanced in other disciplines. I know this because some of the statisticians I frequently chat to in conferences or seminars, they sometimes asked how I solve a certain problem (algorithmic-wise), which are problems similar to what they have or currently trying to solve. I talked about certain techniques/algorithms that I have used for specific problems & most of the times, the techniques I mentioned, they never heard of them , even though those techniques had been available in the literature over the last 7 years or so, but they were published in different literatures (eg : data-mining, machine learning, signal processing) but not statistical literatures and I wasn't surprised that they haven't come across those techniques. The curious ones usually asked if I could write down the technique/s I described in the conversation on a piece of paper for them because he/she thinks its applicable to what he/she's trying to solve. I write down the names of the algorithms on a piece of paper & if I remember some of the paper titles where those algorithms were published, then I write them down as well, but if not, then I email them later.

One of those informal discussions, I   mentioned to one statistician to use  "Hot-SAX" algorithm for his time-series & event sequence similarity analysis, because using cross-correlation is inefficient (that's what he had been using). In fact, he never heard of  Hot-SAX but I wasn't surprised because Hot-SAX was published in a data mining journal (KDD) not in statistical related journal.

Comment by Carla Gentry on January 1, 2015 at 3:15am

Totally agree, well said Neil and Happy New Year 2015 to all the data lovers out there :o)

Comment by Luis Angel Cajachahua Espinoza on January 1, 2015 at 2:34am

Hi Vincent. I think you could add the word "Some" (or any other) in the Title. It's clear that you are talking of a group of statisticians, not about all of them. Happy New Year!

Comment by Vincent Granville on December 31, 2014 at 2:12pm

Well said Neil!

Comment by Neil Raden on December 31, 2014 at 12:06pm

"it is a classic syndrome for famous academic statisticians, they believe that they are the only ones having original ideas, and that anything remotely close to what they do is plagiarism)." Vincent, this problem is endemic among all academics. There is a classic story told by Henry Kissinger. When Nixon tapped him be National Security Advisor, his colleagues at Columbia threw a going-away party for him. Both his old associates, and his new ones from Washington, attended. An argument erupted among some of his old pals and the volume and intensity rose very quickly. A reporter asked Kissinger why arguments among academics are so vicious. Henry replied, "Because the stakes are so low." 

Let it roll off your back. I find people plagiarizing me all the the time. I consider it a compliment.

-NR

Comment by Vincent Granville on December 30, 2014 at 12:21pm

One more comment about these people. They seem to disdain people making good money leveraging their data science or statistical knowledge. It is as if we are born in wealth, go to Havard (as Andrew Gelman did), spend $200K in education, and then work for free, publish articles for free in statistical journals that sell them for money, write book reviews for free for publishers that sell them for money, and the list goes on. And if their statistical techniques were that great, they would easily build a network as large as DSC, but so far they haven't succeeded - not by a very long stretch (though since they target the elite, there are just so many people that could be admitted in their circles and old boys clubs). 

Comment by Carla Gentry on December 29, 2014 at 4:30pm

My pleasure Vincent, Happy New Year!

Comment by Vincent Granville on December 29, 2014 at 3:21pm

Thanks Carla. I appreciate your support!

Comment by Carla Gentry on December 29, 2014 at 3:46am

Seems you aren't the only one with a problem with him, I found lots of articles complaining about him and his "Holier than thou mentality"

Why I disagree with Andrew Gelman's critique of my paper about the rate of false discoveries in the medical literature"  http://simplystatistics.org/2013/01/24/why-i-disagree-with-andrew-g... 

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service