# Covid-19 Modeling: Impact of Missing Data and Ignoring Key Features

Many data scientists and epidemiologists have developed various models and predictions, mostly alarming ones. Here I discuss issues that make most of these analyses meaningless, and how to fix these problems. Sadly, these are mistakes that even expert statisticians make in a variety of contexts, especially when dealing with a tiny sample that behaves radically differently from the whole population. You don't see what is not in the data, you only see what's in it, and you must think out-of-the-box to imagine what the outside of your narrow data set might look like.

Several flaws are discussed below.

1. Very heavily censored data

Censored data is when you miss important parts in your data set, because it is not captured at all. For instance, if you measure time-to-crime between the purchase of a gun and a crime is committed with that gun, there are plenty of guns that have not caused a crime yet, and also many that will never result in a crime. So you need to take this into consideration when designing models resulting in public policy decisions.

We have the same situation here with Covid-19. I was lucky (as a data scientist) to be infected earlier, as well as my family and some friends. Out of 10 people that I know, none were tested and we all recovered. So when reading statistics such as 20% of the people infected end up in an hospital, I knew something was very wrong with that number. My reasoning was this: if only one out of 20 positives get tested (it could be one out of 85, read this article), and that person has a 20% chance to go to the hospital while the other 19 individuals don't need to, then the actual rate of hospitalization is (19 x 0% + 1 x 20%) / 20 = 1.95%. A far cry from 20%. If 30% of them (in an hospital) die, that means death rate is 0.65%.

Now some who do not get tested will die from the virus (in an hospital or elsewhere) and be unaccounted for. Thus the actual death rate should be above 0.65%, but not by much.

Then the way death attribution is performed needs to be addressed. How do you decide that someone died from Covid-19? It is obvious in cases where the person was tested, but less so in other cases. If all positive people dying are reported as having died from the virus no matter what the real cause is (car accident), this will inflate the numbers, possibly explaining big discrepancies between countries using different counting methods.

Finally, false positives and false negatives need to be addressed, as it makes the data (and thus the models and predictions) very messy. Blending various data sources instead of relying on a single one, may help with this.

2. Using the wrong metrics

Death rate is not a good metric if the average dying patient is 75-year old. A much better metric is the average reduction in lifespan due to the virus. For the young and healthy who die (a minority), it may be a 60 years reduction in lifespan. For the young and very sick (cancer) or old patients (the vast majority), it could be less than 5 years. Biostatisticians worth their grain of salt should be able to easily make these computations, broken down by population segment. Note that some who do not die may have permanent damage and may die 10 years from now, instead of in 20 years had they not been infected.

Also, if there is a significant decrease in other deaths (heart attacks, flu, etc.) one has to wonder if Covid-19 killed some people who would otherwise have died from such ailments, especially older people in poor health. Or people dying from flu being reported as having died from the virus, things like that.

Probably the best way to have a good picture of the situation is this: find out how many more deaths are occurring this week versus same week the year before, in your area -- week over week. At its peak, it might be twice as much, a clear evidence the virus is a big killer. But computed over a 12-months period for a large area with little yearly variations (deaths in US in 2020 vs. 2019) the factor is likely to be much closer to 1. This can be artificially reassuring though, because confinement is driving that factor well below its true value, it's simply extending the problem over several years (hoping a vaccine will help at some point).

3. Ignoring critical metrics

This is where the cure may be worse than the disease. When I saw my favorite restaurant firing all the staff, I knew it would result in more than 10 millions unemployed, many permanently as a number of small businesses will never re-open. With a domino effect: restaurant providers, farmers, landlords, etc being hit hard. It was the time to short the stock market heavily. This was highly predictable. Suicide, crime, drug abuse, civil unrest, despair can also be measured by a "death rate" metric. It also impacts the young more than the older people who for the most part survive on retirement benefits. For more on this, read this article. Another source claims that the economic downturn could kill hundreds of thousands of children worldwide, see here. More here

How many of the 20+ millions newly unemployed will lose their health insurance? Or worried about the cost of an hospitalization, or worried about contracting the virus in an hospital? I would expect to see a sharp decrease in the number of people going to the doctor or to the hospital. Some might see it as an indication that  the pandemic is easing, but actually the cause is different and alarming. People will be in worse health if skipping doctor visits or vaccinations (expect an increase in measles cases), because they have no money. Ironically, some will become homeless, roaming the streets and spreading the virus, defeating the purpose of confinement, as they won't have a place anymore for confinement.

My last word is to be very cautious about what you read in the news, or what you hear from the government. Much of what is said is by would-be experts or would-be statisticians who lack the big picture and can not properly interpret data, not even measure it. Also, many politicians on both sides of the spectrum are probably very active in funding their propaganda agenda, including creating Facebook profiles and paying Facebook users to disseminate their fake information. Take with a big grain of salt anything you read on Facebook or in the news!

Views: 5254

Comment

Join Data Science Central

Comment by Lance Norskog on April 23, 2020 at 10:08pm

Actuaries have been calculating death rates for insurance companies for 150 years, and they're pretty good at it by now. It should be possible to compare the projected&actual death rates in a community/congressional district, and impute the remainder to either direct Covid-19 death or people dying of heart attacks in overloaded hospitals or people avoiding hospitals when they should go in.

Comment by Vincent Granville on April 22, 2020 at 3:31pm