Statistical Significance and Its Part in Science Downfalls

In Discover Magazine this month is a really frightening article about the failure of psychologists to reproduce the results of 100 high profile psychology experiments published in 2008. 350 scientists attempted to reproduce the results of these experiments with these results:

  • Researchers not involved in the initial studies contacted the original authors to get feedback on their protocols; in most cases, the original researchers helped with study designs and strategies. Despite this thoroughness, while 97 of the original studies reported significant results, only 35 of the replications reported the same. And even then, the effect size (a measurement of how strong a finding is) was smaller — on average, less than half the original size.

With only 1 in 3 of the reproduced experiments showing significant results and even then with much smaller effect size, it’s fair to say that psychology has, as the article states, a ‘reproducibility crisis’.

Without making fun of our social science brethren, we’ll start with the assumption that they were well intentioned and knowledgeable regarding establishing statistical significance.  So what could have gone wrong?

Although it’s now about two years old, there’s an interesting article by Hilda Bastian (November 2013) that shares the title of this blog that’s worth reading if you missed it the first time around.  She reminds us:

  • Testing for statistical significance estimates the probability of getting roughly that result if the study hypothesis is assumed to be true. It can't on its own tell you whether this assumption was right, or whether the results would hold true in different circumstances. It provides a limited picture of probability, taking limited information about the data into account and giving only "yes" or "no" as options.

What's more, the finding of statistical significance itself can be a "fluke," and that becomes more likely in bigger data and when you run the test on multiple comparisons in the same data.

It would be wonderful if there were a simple single statistical measure everybody could use with any set of data and it would reliably separate true from false.

Yet, statistical significance is commonly treated as though it is that magic wand. Take a null hypothesis or look for any association between factors in a data set and abracadabra! Get a "p value" over or under 0.05 and you can be 95% certain it's either a fluke or it isn't. You can eliminate the play of chance! You can separate the signal from the noise!

Except that you can't. That's not really what testing for statistical significance does. And therein lies the rub.  So this is a bit of a cautionary tale that we should remember not to be seduced by our own statistics and consider the possibility that statistical significance is not a synonym for ‘true’.  Take a look at the rest of Hilda’s article for some additional nuance on this issue.

Views: 5090


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Michael Bryan on September 3, 2016 at 6:15pm

There is an enormous elephant in the room that has little to do with statistics.

I'd lay hard cash that these "high profile experiments" had a publish motive.  Academia has long been plagued by pressure to journal results, and repeatability has become comedy.  Commercial research has been more goal driven and less apologetic.  Ronald Coase famously quipped "If you torture the data enough, nature will always confess".

It's an issue - if we want to be treated as a science - but we should look beyond nuance of statistical significance.

Even the American Statistical Association has weighed in.  Required reading:


Comment by Mike Goings on December 28, 2015 at 7:52am

I concur that there can be problems with hypothesis testing and that statistical significance should not be considered the "magical wand." However, I disagree that it is a frightening article and the fact that the results could not be reproduced should not be considered a flaw against psychology. These findings do not surprise me. I think it shows how hard it is to make psychological predictions. That is, we are starting to understand all the things that influence cognition and will behavior. The investigator can control the experiment to the maximum extent possible, but s/he cannot control life. For example, interaction with new technology  (something that happens everyday) may have influenced the outcome of a psychology study. The minor difference of the the investigator's appearance can influence the outcome of a study. These are factors that even big data has not yet been able to account for because, as organisms that live in an ever changing environment, we are constantly changing with our environment.  

Comment by George Damianov on December 28, 2015 at 2:19am

There is one "rule-of-thumb" to remember when testing statistical hypotheses of any kind to get more truthful  inference - always try and make the statement of interest you wish to prove true to be your alternative hypothesis, and not your null hypothesis. Accepting the null hypothesis only states that the data at hand does not contradict it - it does not state anything about whether its statement is uniformly true or not. But when the null hypothesis is rejected and the alternative is accepted, then the statement of the latter is uniformly true. Just be careful when reversing a testing problem - it is by far not trivial.

Comment by Mike Morgan on December 24, 2015 at 11:36am

The replicability findings are indeed worrisome.  But, it seems unclear what kinds of controls and safeguards went into the replication efforts, including whether care was taken to use a correct formula for "the probability of confirming the previous findings."  It's worth pointing out that the scientific method requires refutation of an existing, commonly held assumption, or at least a well-reasoned, well-balanced counterfactual.   Researchers must acknowledge that a finding of statistical significance does not, and never did, claim to represent Truth.  Instead, the significance test is just a little chisel to try to pluck away at existing theories and assumptions and to make a dent in human ignorance.  Without that functionality, knowledge cannot grow or improve.  The study emphasizes how easy it is to get null results, but what do we really learn from null results?  Sorry, but I'm not buying into any claims that we've learned something new from this current study.   

Comment by Harlan A Nelson on December 24, 2015 at 7:44am

This is actually very old news.  Look back at controversies over multiple effects comparison, publication bias, and effect size vs. statistical significance to see that there has been debate for years.  Also, Bayesians consider parameters to be random variables, but frequentists consider them to be fixed (mostly).  That also produces confidence intervals that are too narrow when data samples are relatively small with respect to the number of parameters estimated.   I think data mining has exasperated the problem.  In my experience, when using larger observational data, the attained significance levels should be zero to at least four or five decimals and the effect size should be large enough to take note of.  In addition, sub populations should be scanned to see if one of them is responsible for most of the effect.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service