.

# Statistics is Dead – Long Live Data Science…

I keep hearing Data Scientists say that ‘Statistics is Dead’, and they even have big debates about it attended by the good and great of Data Science. Interestingly, there seem to be very few actual statisticians at these debates.

So why do Data Scientists think that stats is dead? Where does the notion that there is no longer any need for statistical analysis come from? And are they right?

Is statistics dead or is it just pining for the fjords?

I guess that really we should start at the beginning by asking the question ‘What Is Statistics?’.

Briefly, what makes statistics unique and a distinct branch of mathematics is that statistics is the study of the uncertainty of data.

So let’s look at this logically. If Data Scientists are correct (well, at least some of them) and statistics is dead, then either (1) we don’t need to quantify the uncertainty or (2) we have better tools than statistics to measure it.

## Quantifying the Uncertainty in Data

Why would we no longer have any need to measure and control the uncertainty in our data?

Have we discovered some amazing new way of observing, collecting, collating and analysing our data that we no longer have uncertainty?

I don’t believe so and, as far as I can tell, with the explosion of data that we’re experiencing – the amount of data that currently exists doubles every 18 months – the level of uncertainty in data is on the increase.

So we must have better tools than statistics to quantify the uncertainty, then?

Well, no. It may be true that most statistical measures were developed decades ago when ‘Big Data’ just didn’t exist, and that the ‘old’ statistical tests often creak at the hinges when faced with enormous volumes of data, but there simply isn’t a better way of measuring uncertainty than with statistics – at least not yet, anyway.

So why is it that many Data Scientists are insistent that there is no place for statistics in the 21st Century?

Well, I guess if it’s not statistics that’s the problem, there must be something wrong with Data Science.

So let’s have a heated debate...

## What is Data Science?

Nobody seems to be able to come up with a firm definition of what Data Science is.

Some believe that Data Science is just a sexed-up term for statistics, whilst others suggest that it is an alternative name for ‘Business Intelligence’. Some claim that Data Science is all about the creation of data products to be able to analyse the incredible amounts of data that we’re faced with.

I don’t disagree with any of these, but suggest that maybe all these definitions are a small part of a much bigger beast.

To get a better understanding of Data Science it might be easier to look at what Data Scientists do rather than what they are.

Data Science is all about extracting knowledge from data (I think just about everyone agrees with this very vague description), and it incorporates many diverse skills, such as mathematics, statistics, artificial intelligence, computer programming, visualisation, image analysis, and much more.

It is in the last bit, the ‘much more’ that I think defines a Data Scientist more than the previous bits. In my view, if you want to be an expert Data Scientist in Business, Medicine or Engineering then the biggest skill you’ll need will be in Business, Medicine or Engineering. Ally that with a combination of some/all of the other skills and you’ll be well on your way to being in great demand by the top dogs in your field.

In other words, if you want to call yourself a Data Scientist you really do need to be an expert in your field as well as having some of the other listed skills.

## Are Computer Programmers Data Scientists?

On the other hand – as seems to be happening in Universities here in the UK and over the pond in the good old US of A – there are Data Science courses full of computer programmers that are learning how to handle data, use Hadoop and R, program in Python and plug their data into Artificial Neural Networks.

It seems that we’re creating a generation of Computer Programmers that, with the addition of a few extra tools on their CV, claim to be expert Data Scientists.

I think we’re in dangerous territory here.

It’s easy to learn how to use a few tools, but much much harder to use those tools intelligently to extract valuable, actionable information in a specialised field.

If you have little/no medical knowledge, how do you know which data outcomes are valuable?

If you’re not an expert in business, then how do you know which insights should be acted upon to make sound business decisions, and which should be ignored?

## Plug-And-Play Data Analysis

This, to me, is the crux of the problem. Many of the current crop of Data Scientists – talented computer programmers though they may be – see Data Science as an exercise in plug-and-play.

Plug your dataset into tool A and you get some descriptions of your data. Plug it into tool B and you get a visualisation.

Want predictions? Great – just use tool C.

Statistics, though, seems to be lagging behind in the Data Science revolution. There aren’t nearly as many automated statistical tools as there are visualisation tools or predictive tools, so the Data Scientists have to actually do the statistics themselves.

And statistics is hard.

So they ask if it’s really, really necessary.

I mean, we’ve already got the answer, so why do we need to waste our time with stats?

Booooring….

So statistics gets relegated to such an extent that Data Scientists declare it dead.

Talk about the lunatics running the asylum…

What do you think?

Is statistics dead? Is there no place for statistics in data science or is it essential?

Join the debate below and let me know your thoughts...

Lee Baker is an award-winning software creator with a passion for turning data into a story.

A proud Yorkshireman, he now lives by the sparkling shores of the East Coast of Scotland. Physicist, statistician and programmer, child of the flower-power psychedelic ‘60s, it’s amazing he turned out so normal!

Turning his back on a promising academic career to do something more satisfying, as the CEO and co-founder of Chi-Squared Innovations he now works double the hours for half the pay and 10 times the stress - but 100 times the fun!

PS - Don't forget to connect with me in Twitter: @eelrekab

This post has been published previously in Innovation Enterprise and LinkedIn Pulse

Views: 25432

Tags: data science, statistics

Comment

Join Data Science Central

Comment by Lee Baker on January 8, 2018 at 1:29am

@Pablo

You are of course correct that all these Data Scientists are human, with typical human frailties. We all have them.

Declaring statistics to be dead says more about their own human frailties than it does about statistics or Data Science. It would serve Data Science better for the 'statistics deniers' to look in the mirror, admit their frailties and learn more about stats.

They would end up as better Data Scientists.

After all, progress isn't made by denying that there's a problem...

Comment by Pablo Bernabeu on December 25, 2017 at 2:56pm

Awesome post. Perhaps those 'lunatics' -affectively dubbed- are just very human persons. Nobody likes uncertainty, of all things, in any degree. Plots and predictions might better suit our natural preference.

Comment by CPA Stephen Omondi Okoth on October 8, 2017 at 12:25am

Good perspective there. Statistics is one of the important data science tools. There are alternative data science tools and the choice of which tool to use depends on the application of data science.

Comment by Todd Lane on September 28, 2017 at 9:10am
I'm just a student data scientist, no expert or professional, but personally no I don't think statistics is dead. All of my instructors have actually stressed the need for statistics to one degree or another. Maybe the task of just statistician is dying, but that would perhaps be that statisticians are becoming data scientists themselves. There is such a debate about what a data scientist is, but to me it is, along with the skills talked about in this article, taking your data and applying the scientific method to a problem you want to answer or understand.
Comment by Tom Ke Tao on March 27, 2017 at 10:48am

A data science tool always has some statistical implication. Otherwise, the tool itself has some problem. To use the tool correctly, a data scientist is better to understand its statistical implication and the business data he worked on.

Comment by Bernard F Siahaan on August 31, 2016 at 4:06pm

Statistics will never die ... because data science without statistics is bullshit and without understanding of statistics, a data scientist will die. In the future, there's will be so many people claim themselves as data scientist but basically they are just a data administrator.

The fact is that statistician that invited/created the first computer ... and of course data science (Maybe!!!).

Comment by Boris Shmagin on August 22, 2016 at 12:58pm

This is very interesting situation; I mean the post and comments. What is statistics? How we define the death of it?

Everybody has own definition about statistics which is generally related to education and experience. Therefore, I do not understand the discussion on so personal matter.

The death is very serious thing. With media as paper for some time and the digital as it is now, the statistics as system of logical and mathematical concepts, models, and rules will be “alive”.

Comment by Lee Baker on July 27, 2016 at 1:50pm

@Jason,

That's what I'm getting at - it's about the skills, knowledge and experience, not the tools.

http://www.datasciencecentral.com/profiles/blogs/why-good-data-scie...

Comment by Lee Baker on July 27, 2016 at 1:48pm

@Leonardo

Nobody said you can't perform brain surgery after reading about it on the web, and nobody said you can't do data science after reading about it.

Of course, if you want your patient and your data to survive the process...

:-)

Comment by Jason Williams on July 27, 2016 at 12:01pm

I agree that domain expertise is the most important quality. The jobs that I have received are not because I'm good at R/Python/Spark but because I know what drives success in my business type and what questions to ask.
When I do sentiment analysis, I have a strong understanding of what my customers want. And while I may be able to write clustering code in Spark or Sparklyr, that's not what is important - what's important is that I know how extract what's important to my customer base and drive those insights forward. A whole new debate (and you touched on it) should be how many data science and analytics programs emphasize tool knowledge over domain knowledge. I think it misleads upcoming data scientists.