Statistical Significance and p-Values Take Another Blow

I read an article this morning, about a top Cornell food researcher having 13 studies retracted, see here. It prompted me to write this blog. It is about data science charlatans and unethical researchers in the Academia, destroying the value of p-values again, using a well known trick called p-hacking, to get published in top journals and get grant money or tenure. The issue is widespread, not just in academic circles, and make people question the validity of scientific methods. It fuels the fake "theories" of those who have lost faith in science. 

The trick consists of repeating an experiment sufficiently many times, until the conclusions fit with your agenda. Or by being cherry-picking about the data you use, or even discarding observations deemed to have a negative impact on conclusions. Sometimes, causation and correlations are mixed up on purpose, or misleading charts are displayed. Sometimes, the author lacks statistical acumen. 

Usually, these experiments are not reproducible. Even top journals sometimes accept these articles, due to

  • Poor peer-review process
  • Incentives to publish sensational material 

By contrast, research that is aimed at finding the truth, sometimes does not use p-values nor classical tests of hypotheses. For instance, my recent article comparing whether two types of distributions are identical, does not rely on these techniques. Also the theoretical answer is know, so I would be lying to myself by showing results that fit with my gut feelings or intuition. In some of my tests, I clearly state that my sample size is too small to make a conclusion. And the presentation style is simple so that non-experts can understand it. Finally, I share my data and all the computations. You can read that article here. I hope it will inspire those interested in doing sound analyses.

Below are some extracts from the article I read this morning:

Some of the retracted papers include studies suggesting people who grocery shop hungry buy more calories; that preordering lunch can help you choose healthier food; and that serving people out of large bowls encourage them to serve themselves larger portions. Not that the conclusions are necessarily wrong, but because these studies are based on questionable data and misuse of statistical techniques. Below are some extracts from the article that reported the issue.

P-values of .05 aren’t that hard to find if you sort the data differently or perform a huge number of analyses. In flipping coins, you’d think it would be rare to get 10 heads in a row. You might start to suspect the coin is weighted to favor heads and that the result is statistically significant. But what if you just got 10 heads in a row by chance (it can happen) and then suddenly decided you were done flipping coins? If you kept going, you’d stop believing the coin is weighted.

Stopping an experiment when a p-value of .05 is achieved is an example of p-hacking. But there are other ways to do it — like collecting data on a large number of outcomes but only reporting the outcomes that achieve statistical significance. By running many analyses, you’re bound to find something significant just by chance alone.

There’s a movement of scientists who seek to rectify practices in science like the ones that Wansink is accused of. Together, they basically call for three main fixes that are gaining momentum.

  1. Preregistration of study designs: This is a huge safeguard against p-hacking. Preregistration means that scientists publicly commit to an experiment’s design before they start collecting data. This makes it much harder to cherry-pick results.
  2. Open data sharing: Increasingly, scientists are calling on their colleagues to make all the data from their experiments available for anyone to scrutinize (there are exceptions, of course, for particularly sensitive information). This ensures that shoddy research that makes it through peer review can still be double-checked.
  3. Registered replication reports: Scientists are hungry to see if previously reporting findings in the academic literature hold up under more intense scrutiny. There are many efforts underway to replicate (exactly or conceptually) research findings with rigor.

There are other potential fixes too: There’s a group of scientists calling for a stricter definition of statistically significant. Others argue that arbitrary cutoffs for significance are always going to be gamed. And increasingly, scientists are turning to other forms of mathematical analysis, such as Bayesian statistics, which asks a slightly different question of data. (While p-values ask, “How rare are these numbers?” a Bayesian approach asks, “What’s the probability my hypothesis is the best explanation for the results we’ve found?”)

Related articles

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.

DSC Resources

Views: 5600


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Robert Tullo on September 23, 2018 at 3:59pm

It is definitely true about cherry picking results.  While not ideal, if I have a number of pre-determined hypotheses to test, I would apply the Holm-Bonferroni correction to reduce the alpha.  While that isn't going to fix the problem, at least this method tries to take into account the fact you can always prove something if you look at enough things...see this great cartoon from xkcd to prove a point :)

Even so, a key point Vincent brings up is that the studies are generally not reproducible, nor is the data made available to ensure the non-conforming data isn't dropped using esoteric logic.  At the end of the day, too often the studies are not actual experiments, which is the underlying problem.

Comment by Paul Bremner on September 21, 2018 at 3:04pm

I couldn't agree more with what you say, particularly in highlighting the poor review process at journals and the incentives to publish something "different." I think particularly in the medical field the view amongst doctors is that as long as you have at least 30 patients to analyze, you've got a normal distribution and you're good to go.  There doesn't seem to be much attention paid to research design which may throw off the numbers and findings in one direction or another.  Part of this, I'm sure, is that there's not much funding for smaller studies so you get tons of itty-bitty sized studies with results that are all over the map and there's a reluctance by the journals to do any real peer review (i.e. shoot down things that defy common wisdom.)

There are larger, much better funded studies, however, done by reputable people with good knowledge of statistics and/or science and that's where you need to look for good info. You just need to know who the real producers and experts are, and focus on what they say. 

And there are other ways to goose your publication rate.  One is to head a group of researchers and insist that, as the director, anything produced by your group includes your byline.  Presto, your Web of Science ratings and citation index goes up and you're the go-to person for future grants on a given subject.  

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service