It sounds almost heretical to ask the question: Is there too much scientific research?
The need for more research on — fill in favorite subject — is one of those self-evident truisms. Diverse medical communities seeking government funding, foundation grants and donations in the cause of curing disease certainly concur with it. The political debate over climate change produces highly charged wars of words, but both sides embrace the idea of more research. Universities and nonprofit research organizations plead vociferously for more research dollars. Flying in the face of this great need is a perceived drying up of research funding: the National Institutes of Health and other government research agencies are flatlined as a result of ferocious fiscal battles and the budget stalemate in Congress.
So how could one possibly think there is too much research?
A look at the results of current research is instructive. In August 2015 Brian Nosek and colleagues at the Center for Open Science, which he co-founded, shared an examination of 100 different studies published in 2008, all in the area of psychology. All but three of the studies had reported statistically significant findings. Nosek and company set out to replicate them, consulting with the original authors and using the same methods. Only 36 percent of the original studies were confirmed, and those that were had smaller effect sizes in the replication than in the original study. John Ioannidis, who has written extensively on the problem of scientific findings that evaporate on reexamination, drew the attention of the scientific community to this problem earlier with his 2005 paper, “Why Most Published Research Findings Are False.”
At the heart of this reproducibility problem is the statistical inference methods used to validate research findings — specifically the concept of “statistical significance.” A statistically significant result is one that differs substantially from what you might expect from random chance. This sounds reasonable, if a bit vague, but when the notion is made more concrete it morphs into a “statistical black box” that is beyond the ken or interest of most researchers. Most are interested solely in their data and their findings, and passing the test of statistical significance is simply a necessary procedural step, like getting your passport stamped at the border.
Almost like that. If you are a traveler turned away at one entry point and then try to enter at another, the immigration agency will remember your first attempt and you’ll probably be denied again. Not so with the gatekeepers of research. If your preliminary findings do not pass the bar of statistical significance, you can get other bites of the apple. Suppose you are looking at the effect of vitamin X on health, and you decide to use one of the large epidemiological cohort data sets (for example, the “Framingham Study,” begun in 1948 with residents of Framingham, Mass.).
You might find, to your disappointment, there is no relationship between vitamin X and health in the data. But you can then go back and look just at women; or men; or men over 50. Who’s to know how many subgroups you look at before finding a relationship? The protection that statistical inference offers against being “fooled by chance” disappears when you repeatedly hunt for interesting patterns in large data sets (unless you correctly apply so-called multiple-testing procedures that raise the bar of statistical significance).
Consider these scenarios — How would you interpret them?
Scenario 1: A person claims to be able to toss a coin and “will” it to land heads on each toss. You ask the person to toss a quarter 10 times, and it comes up heads all 10 times.
Scenario 2: The announcer at a Yankees game asks all 20,000 fans in attendance to toss a coin 10 times and report if they got all heads. The fan in section 301, row P, seat 12 announces to an usher that he got all heads.
In the first scenario you have done a single “test” with remarkable results, and you’re sufficiently surprised by those results to think the person has unusual abilities. In the second scenario you have done 20,000 tests — in other words, you’ve created 20,000 opportunities for something unusual to happen. It is not at all surprising, therefore that some fan would get 10 heads in a row (in fact, it is almost a certainty).
The American Statistical Association held a symposium in mid-October on statistical inference. There, John Ioannidis and Steve Goodman laid out the challenge that faces the statistical profession, as its “Good Housekeeping Seal of Approval” on research steadily loses value. Ioannidis said, “We are drowning in a sea of statistical significance”…and…“p-values [a standard method of calculating significance] have become a boring nuisance.” The symposium was a follow-on to the ASA statement on p-values from last year, and the attendees debated possible technical solutions to the problem — for example, switching from p-values to confidence intervals around effect sizes.
But the issue is much more fundamental. Too many researchers, under career pressure to produce publishable results, are chasing too much data with too much analysis in pursuit of significant results. The number of scientific papers indexed by PubMed in 2011 topped 1.2 million — a quadrupling since 1980. Is this anywhere proportionate to the amount of breakthrough knowledge and innovation waiting to be discovered?
Bruce Alberts et al. alluded to this problem in their article discussing systemic flaws in medical research. As they put it: “…most successful biomedical scientists train far more scientists than are needed to replace him- or herself; in the aggregate, the training pipeline produces more scientists than relevant positions in academia, government and the private sector are capable of absorbing.”
And as more papers get published, more get retracted. The problem is particularly severe in China. The journal Tumor Biology retracted 107 published papers from China earlier this year after finding their peer review process was faked. A survey of Chinese biomedical researchers published this year in Science and Engineering Ethics came up with an estimate that 40 percent of research in China is tainted by misconduct.
Steve Goodman, at the symposium, tended to concur that the reproducibility problem in research is driven by the number of researchers seeking publication, and that replacement of the p-value with other criteria would buy temporary improvement but other publication criteria would likely be gamed as well.
Consider proposals to reduce the p-value threshold from 0.05 to 0.005. Would this help? It might conceivably make things worse: Raising the statistical significance bar 10-fold will indeed pose a greater obstacle to the publication of research results. But good research that has sound design; is honestly conducted and reported; and has potential to be replicated will be completely stymied. Unsound or dishonest research that relies on “p-hacking” will merely need a wider search to locate the magical results that meet the test of statistical significance.
Is this overstating the problem? Oversimplifying, perhaps.
The problem of reproducibility is particularly acute in research that involves exploration of existing data in search of something interesting (that is, publishable) as opposed to an experiment in which a hypothesis is stated in advance, then data are collected to test it. The latter, if conducted honestly, has a built-in mechanism to limit spurious results. And at the heart lies the question of motivation — Is the research driven by curiosity and the need to answer a pressing question? Or is it driven by career considerations of the researcher?
Galit Shmueli, a noted data analytics author who has published widely on the distinction between using statistics to explain versus predict, disputes the notion of research satiation. She contends today’s technological landscape will require more good relevant research in management, social science and the humanities.
Still, it remains the case there is no natural connection between the supply of researchers (driven largely by government funding and by the growing size of the higher education sector) and the supply of good, relevant research results. It is the large and growing number of researchers striving for publishable results that leads to conclusions that overreach and can’t be replicated. The statistics profession can supply a more holistic and less “gameable” threshold for publication, but this will not reduce the pressure to game the system.
About the author:
Peter Bruce founded The Institute for Statistics Education at Statistics.com in 2002. He is a co-author of Data Mining for Business Analytics, 5 editions with Korean and Chinese translations (Wiley), and Practical Statistics for Data Scientists: 50 Essential Concepts (O’Reilly, 2017), the author of Introductory Statistics and Analytics: A Resampling Perspective (Wiley), and the co-developer of Resampling Stats software.