Subscribe to DSC Newsletter

Statistics is Dead – Long Live Data Science…

I keep hearing Data Scientists say that ‘Statistics is Dead’, and they even have big debates about it attended by the good and great of Data Science. Interestingly, there seem to be very few actual statisticians at these debates.

So why do Data Scientists think that stats is dead? Where does the notion that there is no longer any need for statistical analysis come from? And are they right?

Is statistics dead or is it just pining for the fjords?

I guess that really we should start at the beginning by asking the question ‘What Is Statistics?’.

Briefly, what makes statistics unique and a distinct branch of mathematics is that statistics is the study of the uncertainty of data.

 

So let’s look at this logically. If Data Scientists are correct (well, at least some of them) and statistics is dead, then either (1) we don’t need to quantify the uncertainty or (2) we have better tools than statistics to measure it.

 

Quantifying the Uncertainty in Data

Why would we no longer have any need to measure and control the uncertainty in our data?

Have we discovered some amazing new way of observing, collecting, collating and analysing our data that we no longer have uncertainty?

I don’t believe so and, as far as I can tell, with the explosion of data that we’re experiencing – the amount of data that currently exists doubles every 18 months – the level of uncertainty in data is on the increase.

 

So we must have better tools than statistics to quantify the uncertainty, then?

Well, no. It may be true that most statistical measures were developed decades ago when ‘Big Data’ just didn’t exist, and that the ‘old’ statistical tests often creak at the hinges when faced with enormous volumes of data, but there simply isn’t a better way of measuring uncertainty than with statistics – at least not yet, anyway.

 

So why is it that many Data Scientists are insistent that there is no place for statistics in the 21st Century?

 

Well, I guess if it’s not statistics that’s the problem, there must be something wrong with Data Science.

 

So let’s have a heated debate...

 

What is Data Science?

Nobody seems to be able to come up with a firm definition of what Data Science is.

Some believe that Data Science is just a sexed-up term for statistics, whilst others suggest that it is an alternative name for ‘Business Intelligence’. Some claim that Data Science is all about the creation of data products to be able to analyse the incredible amounts of data that we’re faced with.

 

I don’t disagree with any of these, but suggest that maybe all these definitions are a small part of a much bigger beast.

To get a better understanding of Data Science it might be easier to look at what Data Scientists do rather than what they are.

 

Data Science is all about extracting knowledge from data (I think just about everyone agrees with this very vague description), and it incorporates many diverse skills, such as mathematics, statistics, artificial intelligence, computer programming, visualisation, image analysis, and much more.

It is in the last bit, the ‘much more’ that I think defines a Data Scientist more than the previous bits. In my view, if you want to be an expert Data Scientist in Business, Medicine or Engineering then the biggest skill you’ll need will be in Business, Medicine or Engineering. Ally that with a combination of some/all of the other skills and you’ll be well on your way to being in great demand by the top dogs in your field.

 

In other words, if you want to call yourself a Data Scientist you really do need to be an expert in your field as well as having some of the other listed skills.

 

Are Computer Programmers Data Scientists?

On the other hand – as seems to be happening in Universities here in the UK and over the pond in the good old US of A – there are Data Science courses full of computer programmers that are learning how to handle data, use Hadoop and R, program in Python and plug their data into Artificial Neural Networks.

It seems that we’re creating a generation of Computer Programmers that, with the addition of a few extra tools on their CV, claim to be expert Data Scientists.

 

I think we’re in dangerous territory here.

 

It’s easy to learn how to use a few tools, but much much harder to use those tools intelligently to extract valuable, actionable information in a specialised field.

If you have little/no medical knowledge, how do you know which data outcomes are valuable?

If you’re not an expert in business, then how do you know which insights should be acted upon to make sound business decisions, and which should be ignored?

 

Plug-And-Play Data Analysis

This, to me, is the crux of the problem. Many of the current crop of Data Scientists – talented computer programmers though they may be – see Data Science as an exercise in plug-and-play.

Plug your dataset into tool A and you get some descriptions of your data. Plug it into tool B and you get a visualisation.

Want predictions? Great – just use tool C.

 

Statistics, though, seems to be lagging behind in the Data Science revolution. There aren’t nearly as many automated statistical tools as there are visualisation tools or predictive tools, so the Data Scientists have to actually do the statistics themselves.

And statistics is hard.

So they ask if it’s really, really necessary.

I mean, we’ve already got the answer, so why do we need to waste our time with stats?

Booooring….

 

So statistics gets relegated to such an extent that Data Scientists declare it dead.

 

Talk about the lunatics running the asylum…


What do you think?

Is statistics dead? Is there no place for statistics in data science or is it essential?

Join the debate below and let me know your thoughts...


About the Author

Lee Baker is an award-winning software creator with a passion for turning data into a story.

A proud Yorkshireman, he now lives by the sparkling shores of the East Coast of Scotland. Physicist, statistician and programmer, child of the flower-power psychedelic ‘60s, it’s amazing he turned out so normal!

Turning his back on a promising academic career to do something more satisfying, as the CEO and co-founder of Chi-Squared Innovations he now works double the hours for half the pay and 10 times the stress - but 100 times the fun!

PS - Don't forget to connect with me in Twitter: @eelrekab


This post has been published previously in Innovation Enterprise and LinkedIn Pulse

Views: 22973

Tags: data science, statistics

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by leonardo auslender on July 27, 2016 at 11:01am

NOOOOOOOO, they promised me that I could perform brain surgery by reading it from the Web!! I want my money back. 

Comment by Lee Baker on July 13, 2016 at 6:37am

@Halle

It is generally accepted (oh dear, did I just generalise?!??) that data science is the interface of computer programming, maths, stats, artificial intelligence and domain expertise. In my view, the domain expertise is the most important, because without it how can you make powerful, actionable insights? If you work in the medical arena, would you be comfortable suggesting an alternative treatment regime that you found in your data without having any understanding of the underlying causes of the illness or what effect the treatment will have on the patient? I certainly wouldn't!

Here are some useful articles, complete with Venn diagrams, to show the various facets of data science:

Hope these help!

Comment by Halle Davis on July 13, 2016 at 5:35am

I'm interested in your subpoint: "In my view, if you want to be an expert Data Scientist in Business, Medicine or Engineering then the biggest skill you’ll need will be in Business, Medicine or Engineering. Ally that with a combination of some/all of the other skills and you’ll be well on your way to being in great demand by the top dogs in your field." Could you please elaborate on the rationale or evidence that led to this conclusion?

Comment by Lee Baker on July 11, 2016 at 12:30am

@James

"When you have a hammer, everything looks like a nail."

This is the danger of plug-and-play data science. If we are to discover real, actionable insights from data, we need to use the correct tool and know how to use it correctly.

Comment by James Kim on July 10, 2016 at 6:54pm

In the world of hammers, some just use the hammers, and some are hammer crafters.

Similarly, Data Scientists are the experts that get from data to insight more efficiently. I use data science as part of my tool box to reach business objectives. In a way, we do not need to be data scientists to make use of data science.

Comment by Lee Baker on July 7, 2016 at 1:55am

@Alain,

I agree with pretty much everything you've said. In fact, I've just published another DSC blog explaining why (IMO) data scientists are worthy of high wages.

You can read it here:

http://www.datasciencecentral.com/profiles/blogs/why-good-data-scie...

Comment by Alain Debecker on July 7, 2016 at 1:35am

It's a long debate which I know about at least since the 80's.

Yes before the terms of Data Science, Business Intelligence and Data Mining were actually coined. And when a new name is coined, it is of course for the marketing buzz, but also because we need a word to cover a new thing. Otherwise the neologism disappears, like Data Engineering and many others.

In my opinion:

- Statistics is the art of discovering relevant general facts with a few observation (The Sample). It separated from Demographics (the art of evaluating population size, number of potential soldiers and expected tax income without having to proceed to a full census) by applying the techniques to other fields: Biology (Fischer), Bear production (Student),...

- Data Mining is the art of discovering relevant general facts without being overwhelmed by to much observation. The statistics techniques are applied to deal with bad data (sorry outliers), to summarize (sorry average), to group and segment (sorry cluserize), to analyse semantic and sentiment (sorry to quantify). As a consequence, from a long distance, data mining looks like analysis of qualitative data and statistics from quantitative data. But this hides the facts that statistics main concern is (hypothesis) inference and data mining main concern is data massive volume. Of course, data mining was invented after the computer.

- Business Intelligence is the art of handling data relevant for the business, something similar to what observation reports are to physic and astronomy, what "psychological" test and folder are to sociology and medical records to heath science. One of the characteristic in Business data (sales, inventory, accounting, production system,...) is its volume and the facts it is stored and handled by computer. Hence the appearance of special methods and operations (Data warehouse, ETL, Reporting, OLAP, Predictive Analytic, What-if, Forecasting,...)

- Data Science is (in my opinion) the art related to data for itself. How to store and retrieve it. How to move it on network or in a pocket on a USB key. How transform it to produce reports, or graphics, or other computer screens. How to make it send mail or awake robots and softbots,...

So basically, if you had to start a history of Data Science, I would tend to start it with the Sumerian tokens and cuneiform clay tablets, rather than with Wargentin's Tabellverket (the first statistical institute) or with the mag tapes of the UNIVAC.

And of course, if you call it a Science, it must have a sound knowledge of hi-level mathematical model and methods. Discrete mathematics of course, but also statistics, Fourrier transforms and Information theory.

By the way, how can you pretend to be a Data scientist, if you don't know what does measure Shannon negentroy. In what kind of unit would you measure the loss off deleting a column or an index in a database? Or  the consequences in making a decision based on a badly computed report?

I know I disqualify at least 85%-95% of the pretending data scientist, but facts are that data science is hard, and boring, and tedious, and demanding. And may be it's to compensate that salaries at high.

Far from being dead, Statistics is a prerequisite for Data Science, and a very small part of the prerequisites.

Comment by Lee Baker on July 6, 2016 at 10:13am

@Srividya

I'm not saying that statistics is dead - I'm arguing exactly the opposite!

There have been lots of articles over the past few years suggesting that stats might be dead. Here's a couple of them:

Comment by Srividya Kannan Ramachandran on July 6, 2016 at 9:15am

Statistics is a core foundation pillar of data science. Could you share references to articles or reports that claim that statistics is dead? I have not come across anyone claiming that statistics is dead.

The industrial applications of a lot of data science techniques are rapidly changing everyday, but please help me understand how that translates to "Statistics is dead"? Thanks!

Comment by Lee Baker on June 27, 2016 at 1:27pm

@Rusul

Yes indeed, which is why it's so silly (IMHO) that some data scientists are declaring statistics to be dead. There's never been a more important time for statistics!

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service