In Discover Magazine this month is a really frightening article about the failure of psychologists to reproduce the results of 100 high profile psychology experiments published in 2008. 350 scientists attempted to reproduce the results of these experiments with these results:
With only 1 in 3 of the reproduced experiments showing significant results and even then with much smaller effect size, it’s fair to say that psychology has, as the article states, a ‘reproducibility crisis’.
Without making fun of our social science brethren, we’ll start with the assumption that they were well intentioned and knowledgeable regarding establishing statistical significance. So what could have gone wrong?
Although it’s now about two years old, there’s an interesting article by Hilda Bastian (November 2013) that shares the title of this blog that’s worth reading if you missed it the first time around. She reminds us:
What's more, the finding of statistical significance itself can be a "fluke," and that becomes more likely in bigger data and when you run the test on multiple comparisons in the same data.
It would be wonderful if there were a simple single statistical measure everybody could use with any set of data and it would reliably separate true from false.
Yet, statistical significance is commonly treated as though it is that magic wand. Take a null hypothesis or look for any association between factors in a data set and abracadabra! Get a "p value" over or under 0.05 and you can be 95% certain it's either a fluke or it isn't. You can eliminate the play of chance! You can separate the signal from the noise!
Except that you can't. That's not really what testing for statistical significance does. And therein lies the rub. So this is a bit of a cautionary tale that we should remember not to be seduced by our own statistics and consider the possibility that statistical significance is not a synonym for ‘true’. Take a look at the rest of Hilda’s article for some additional nuance on this issue.