# Surprising Correlation Found in Covid-19 Data

It took me about 30 minutes to notice a spectacular correlation between two core metrics related to the virus, allowing me to make better predictions about the evolution of this pandemic in USA, and to provide possibly the best advice on how to reduce your risk of exposure, or at least how to buy some time in the war against this virus.

First, many metrics are useless, you need to pick up the most reliable ones. For instance, the number of people who officially tested positive is meaningless: for each one tested positive, at least 10 were positive at some point, but not sick enough to get tested or require medical treatment, and are thus unaccounted for. See my previous article here for details. Based on that metric alone (death rate if testing positive), it would put the projected deaths in US well above 3 million. This is because we have had 5.6 million people officially tested positive and 174,000 deaths so far, that is a 3% death rate (see here). While the number of people who were once positive is grossly underestimated, the number of deaths is not.

My projection is below 600,000, and I will explain shortly how I came up with that much lower upper bound. A much more reliable statistics is looking at death rate per 100,000 inhabitants, per state, from lowest to highest. If you combine that data with the population density, also broken down per state, you will find a remarkable, high correlation. The data sources that I used are as follows:

Let us denote the death rate as R, and the population density as D. The correlation between log(R) and log(D) is 0.75. The figure below illustrates how the two variables are related:

Below is the full table, broken down by state:

Of course, even the number of deaths is not a perfect metric. Death attribution may vary from state to state. Also demographics and lock-down rules also have a big impact. Yet despite all the noise in the data, a strong pattern emerges: states with lower population density fare better on average, at least for now. So, perhaps even better than wearing a mask or social-distancing, moving from a high population density area to a low one, may be be the safest thing to do. Maybe this remains true even within a same state.

The highest death rate per 100,000 inhabitants being below 200, and found in states that are now significantly improving (New York), it is reasonable to assume that in a worst case scenario, all states will reach that threshold over time, resulting in 600,000 deaths in US.

Numbers highlighted in red in the column "deaths per 100,000 (logarithm)" are alarming. It represents states that haven't reached the full impact yet by a long stretch (given their density) and could significantly worsen: Ohio, California, Virginia, North Carolina, Tennessee, Wisconsin, West Virginia, Vermont, Maine, Oregon. To the contrary, numbers highlighted in red in the column "deaths in the last 7 days" but not in the column "deaths per 100,000 (logarithm)" appear alarming but are in fact somewhat reassuring. It represents states that are getting closer to winning the war: Florida, Georgia, Texas.

Note: you can do the same analysis per county, for any state. I did it for WA, see my post and conclusions in the comment section below.

Views: 6259

Comment

Join Data Science Central

Comment by Lance Norskog on August 24, 2020 at 9:08pm

Yup, a joke from March: "the spread of coronavirus is mediated by two things: how dense is the population, and how dense is the population".

The retail drug trade will drive the spread for a long time. Retail drug buys are based around non-threatening displays, and lonely people (buyers & sellers) force each other to feign social interest. This means they happen with masks, and of course inside.

Comment by Mike Handley on August 23, 2020 at 6:49am

Totally agree with your assertion that the number of test-positives is an unreliable indicator of actual cases. And more to your metric - I never understood why there was so little discussion of densities  early on, not that it wasn't baked in somehow. The experts in scarfs included the 100K leveling in all the narratives but p-density would be an obvious factor in arriving at an understanding of the propagation.  In essence, it does map the propagation to some  geo partition like say sq. mile(s), but as well would need to be weighted to account for the 'veins' of density, ( i.e., mass transit, Cosco Centers, flag burning protests, etc. ).

Good stuff Vince - thank you

Comment by Vincent Granville on August 19, 2020 at 3:43pm

Brian, I believe that articles in news outlets (whether left or right on the political spectrum) are written by authors notoriously lacking (1) analytical thinking and (2) neutrality.

Let it put it this way. If a state X is going to reach 20,000 deaths and it is now at 15,000, it is closer to achieve eradication than another state Y that would also reach 20,000 deaths but is now at 5,000. Of course in state X the number of daily deaths may look very bad compared to state Y (precisely because it moves faster to resolution), but in the end, state X can see the end of the tunnel coming, and state Y can't see it yet, or believe that by some miracle it will never hit 20,000 deaths.

Comment by Brian Richmond on August 19, 2020 at 2:57pm

have you tried % population wearing masks and social distancing?

Comment by Brian Richmond on August 19, 2020 at 2:56pm

hmmm - the article concludes "states that are getting closer to winning the war: Florida, Georgia, Texas."

does "winning the war" here mean spreading COVID fastest?

August 19, 2020, CNN:
"Georgia, Texas and Florida lead the country in coronavirus cases per capita"
https://www.cnn.com/2020/08/19/health/us-coronavirus-wednesday/inde...

Comment by Bill Schmarzo on August 19, 2020 at 1:38pm

Comment by Vincent Granville on August 19, 2020 at 10:41am

Here are some more interesting findings. I checked the same data for WA state (where I live). You can do the same analysis for your own state. Data comes from these sources:

Each data point is now a WA county rather than a state. The correlation between log(death rate) and log(population density) is much weaker, about 0.39. The correlation between log(death rate) and log(total population) is a bit higher, about 0.50. The correlation between total population and population density is 0.87, a high value that does not surprises me.

The full data and results is available here. You can use the same data sources to get the data for any state. For counties where number of deaths is 0, I assigned the value 0 to the logarithm of death rate.

Comment by Stephan Mathys on August 19, 2020 at 9:24am

This is a good analysis. Since many states have widely disparate population densities depending on the geographic location (for example, New York City versus the rest of New York State), what do you think it would take to be able to make a refinement? Would County level data suffice? Or even Zip code? I would bet you could see tighter correlation.

Also, I'm curious why you chose to present the variables as you did on the X- and Y- axes. In my mind, the independent variable would be the Log(Population density), and the dependent variable would be the # Deaths. To a trained observer (you, or I), we can read the chart either way around and know that the correlation exists.

But to the untrained observer (i.e. the casual reader, or maybe if this were to get picked up and inserted into a USA Today story), it appears as if the # of Deaths is independent (X-axis) and the Log(Population density) is dependent. I.e. Having more deaths would cause more population density. I know that's not what you meant, yet it could be easily interpreted that way.

Cheers, thanks for the analysis.

Comment by Vincent Granville on August 19, 2020 at 8:40am

Comment by Bill Schmarzo on August 19, 2020 at 5:36am

Hey Vincent, it looks like your calculations were done in a spreadsheet.  Any chance that you can share that spreadsheet?  Thanks!  Bill