Gross Underestimates and Overestimates from the Same Data: Covid-19 Death Rates Example

Ten-fold differences have been reported in Covid-19 death rates.
Estimate issues are because of how the data is calculated.
Why the same data can yield gross overestimates and underestimates.

What is the actual death rate for Covid-19? After nearly a year of the pandemic, no one can agree on an answer. Depending on which expert you ask, it’s somewhere between 0.53% and 6% for the general population and possibly as high as 35% for older patients with certain pre-existing conditions. These widely varying figures illustrate how difficult it is to take data and make predictions–even with the best data and machine learning tools at your disposal.

Part of the problem is clarity: news sources and blogs in particular cite statistics without clarifying exactly what statistic they are talking about. For example, one MedPageToday article [1] mentions a “fatality rate”; The article doesn’t make it immediately clear if that fatality rate applies to hospitalized cases, cases who have tested positive, or the population as a whole (each of these rates would be vastly different).

Despite Covid-19- fueling one of the largest explosions of scientific literature in history [2], we’re not even close to accurately figuring out what percentage of the population the virus actually kills. All of the reported figures amount to nothing more than data-driven guesswork.

The Anatomy of a Death Rate Calculation

A recent episode of The Guardian’s Science Weekly Update [3] addressed the question of why Covid-19 fatality rates vary so much. In the program, Paul Hunter, professor of medicine at the University of East Anglia, explains that the figures such as the World Health Organization’s reported death rate of 3.4 percent was calculated by the number of Covid-19 related deaths (as recorded on death certificates) divided by the number of confirmed cases (based on positive Covid-19 tests). That figure, called the “case fatality rate (CFR)” is is a statistic that Dr. Robert Pearl M.D. calls “inaccurate and misleading” [4]. Why? Depending on how you look at it, it’s either a gross underestimate or overestimate.

The estimation issues arise because of how the figures are calculated. The CFR is calculated by recording people at the beginning of their illness and at the end of their illness–people who are still ill when the data is being recorded may still go on to die after the figures have been tallied. The 3.4% then is an underestimate–the people who are currently sick and go on to die will push that percentage up to around 5 to 6%. Although around 5% might seem like a reasonable estimate (probably one that matches figures you’ve often seen in the news), note how the above figure is obtained in the first place– the number of deaths divided by the number of cases. There are an unknown number of people, possibly up to 10 times higher than the official counts [5] of people with the virus who don’t get tested. If we could count all of these cases, most of which are probably mild or asymptomatic, then the death rate would be significantly lower, meaning that 3.4% is actually an overestimate. The actual number of deaths as it relates to the actual number of cases in the population is called the “true infection fatality rate (IFR)” and may be as low as 0.53% [7].

The solution to more accurate reporting seems clear: find more cases. But this isn’t as easy as it sounds. One recent study showed that in France, a paltry 10% of Covid-19 cases were actually detected [8].

Throwing in a Few More Complications

Complicating matters even further is that, geographically speaking, detection rates also vary widely . When you try to compare death rate between countries, there may be more than a 20-fold difference in identified cases [5].

Other issues that have lead to overestimates include not accounting for an aging population [6] or the presence of pre-existing medical conditions; The fatality rate for younger, healthier individuals is significantly lower than for older individuals with pre-existing conditions. Researchers at Johns Hopkins used machine learning to discover that age is the strongest predictor of who dies from Covid-19, ranging from a 1% fatality rate for the under-50s to a whopping 34 percent for those over age 85. However, those figures were also based on patients who are symptomatic and is therefore also an overestimate of death risk.

Calculating Covid-19 death rates in the population is a challenge, and case counts are unreliable. In general, we can say that the organizations that are better at identifying mild cases will have the most accurate figures. However, identifying which organization is more “accurate” at this task is a challenge in itself.

Data Doesn’t Always Tell the Right Picture

The fact is, an analysis is only going to be as good as the data at hand. Collecting and analyzing data opens up a myriad of possible statistical biases, [no term] all of which can completely ruin your analysis. And then–assuming you have reliable [no term] data–it then becomes a matter of clearly communicating your results to the general public: a matter which as the above example shows, is no easy task.

References

[1] Here’s Why COVID-19 Mortality Has Dropped

[2] Scientists are drowning in COVID-19 papers. Can new tools keep them…

[3] Covid-19: why are there different fatality rates? – Science Weekly …

[4] Three Misleading, Dangerous Coronavirus Statistics

[5] Estimating the Number of SARS-CoV-2 Infections and the Impact of Mi…

[6] Impact of Population Growth and Aging on Estimates of Excess U.S. D…

[7] A systematic review and meta-analysis of published research data on…

[8] COVID research updates: How 90% of French COVID cases evaded detection

Image: CDC (Public Domain)