My original intent with this article was to write about how to understand statistics in general. However, with the global pandemic on everyone's minds right now, it seems blithe to write an article on understanding statistics without a nod to current events. If you're uncomfortable or unfamiliar with statistics, you might find the facts and figures surrounding Covid-19 hard to decipher. Let's break down the key statistics into plain English and shed a little light on a few terms that will help you digest the news with a little more knowledge.
What it means in plain English: This is how many people you'll pass the virus on to.
The basic reproductive number (R_{0}) measures how many people are infected by one infected person. Estimates for R_{0} for the Covid-19 virus are around 2.2. In other words, if you get Covid-19, you're likely to pass the virus along to about 2.2 other people. That R_{0} is relatively low (the measles, by comparison, has the highest value at 12-15) but that doesn't mean it's less of a problem. In fact, a virus can easily get out of hand if each infected person infects just two others: those two infect four, those four infect eight. That exponential growth means that a single infected individual can lead to one person infecting hundreds of thousands within a few short months.
A word of caution about R_{0} : the number is affected by many biological, sociobehavioral, and environmental factors. When you add these factors into the relatively complex modeling process, this can result in a figure which is "...easily misrepresented, misinterpreted, and misapplied" (Delameter, et. al, 2019). So, if you're outside of epidemiology, you may want to consider any R_{0} as a ballpark figure, rather than one set in stone.
In plain English: We don't know the exact number, but it's probably in between these two.
In the above paragraph, I stated that R_{0} is about 2.2. In statistical terms, we quantify that "about" with a confidence interval.
Confidence intervals are really just stating (in a pretty complicated way) that a figure lies somewhere between two amounts. For Covid-19, R_{0} is actually thought to be anywhere between 1.4 and 3.9 (the confidence interval). In other words, the researcher is confident that the true figure lies somewhere in that interval. But that begs the question, how confident are they? That answer lies in the confidence level.
In plain English: I'm this confident you would get the same results if you repeat my experiment/survey.
For the above statistic (from Dr. Richard Ellison's Transmission of the Novel Coronavirus: Early Findings), the stated confidence level for R_{0} is 95%:
95% CI, 1.4–3.9
What a 95 percent confidence level is saying is that if the survey was repeated over and over again, the new results would match these results 95% of the time.
While confidence levels can go as high as 99%, this is very, very hard to achieve with Covid-19, when vast numbers of mild cases go unreported. Given a perfect world, you could theoretically get a 100% confidence level, but if the world was perfect, we wouldn't need statistics (we'd just have indisputable facts!).
Another example: Dr. Ellison's article also includes a confidence interval and level for the estimated mean incubation period: 5.2 days (95% confidence interval, 4.1–7.0). This can be interpreted as, if the survey/research was to be repeated, then 95% of the time, the mean incubation period would fall somewhere between 4.1 and 7.0 days.
In plain English: The bulk of people are in this range of numbers.
This next statistic was reported in a January 2020 article in The Lancet, titled Clinical features of patients infected with 2019 novel coronavirus ...:
"Median age was 49·0 years (IQR 41·0–58·0)".
In case you're not familiar with the median, it's the middle number (as in, the geographic middle) of a set. The median of 1, 2, 3 is 2: it's right in the center of the ordered list.
The IQR, the interquartile range, is a spread around the median which tells you more about where the bulk of cases lie. You basically take an ordered set of numbers (in this case, a list of age in years from smallest to largest), and chop them up into quarters with the median in the exact middle. Discard the bottom quarter and the top quarter, and what you have left is the IQR: a range from 41 to 58 years old. In other words, most patients are in this age range.
In plain English: Here's how to compare some groups to see if they are different.
The above mentioned Lancet article also refers to continuous variables (compared with the Mann-Whitney U test) and categorical variables (compared by χ^{2} test or Fisher's exact test ). In order to perform statistic analysis, you have to identify the type of variable you're working with. There are dozens of different types of variables, some of which are esoteric, but continuous and categorical variables are fairly straightforward.
Continuous variables go on, and on. Technically, you could try to count them, but you'll never stop counting. For example, try counting the number of stars in the the universe, or the number of seconds from now until the end of time. The Mann-Whitney U test is a way to compare two such groups (in this case, data from ICU and non-ICU patients) to see if their medians (for example, for various treatments, development of complications or time to recovery) are the same.
Categorical variables are those data points that fall neatly into categories. For example, preexisting conditions: diabetes, heart disease, lung problems. Unlike continuous variables, categorical variables can easily be expressed as percentages and compared by a chi-square (χ2) test or Fisher's exact test (the latter is more accurate for smaller groups than chi-square). Don't be scared by these exotic sounding names--they just compare groups; In a more general sense, they test to see whether distributions of categorical variables differ from each another.
In plain English: What's the chances I'm completely wrong?
When you run a statistical test, you choose an "alpha level". Alpha is the probability of making the wrong decision when the null hypothesis is true. If you've never heard of the term "null hypothesis", you can think of it as what's the currently accepted situation. In the case of Covid-19, for example, the null hypothesis might be that "Mortality among all infected patients may be in the range of 0.5% to 4%" (Murthy et al., 2020). Your theory (the alternate hypothesis) is that the figure is much higher. With an alpha level of 5% (the standard), you have a 5% chance of reporting that the mortality rate is wrong, when it's in fact correct.
At this point, you may be wondering why you don't set your alpha level at 0.0001% so that you will almost certainly report the right results. In simple terms, you can't because statistical tests are a balancing act. If you lower the risk of one thing happening (making the wrong decision here), you'll increase the probability of making a wrong decision about something else (in this example, you would increase the risk of a "Type II error").
In plain English: We did the math and it looks good.
This is a wordy way of saying your results have been tested and are found to be sound. Most of the facts and figures you see bandied about on social media haven't been put to the (statistical) test, so to speak. However, if you read a figure in a journal article (like those I've mentioned in this article), someone has done the math, and it all looks correct.
In plain English: What's the odds this happened by chance?
This journal article reported that hypertension increases the risk of mortality in the Covid-19 outbreak, WITH A P-VALUE OF .0008. The p-value, or probability value, is easier to understand if you convert it to a percentage: 00.08%. This means there is a 0.08% chance the results could be random (i.e. happened by chance). To put that another way, there's a 99.92% chance that the findings are correct. Larger p-values (over 5%) equate to more uncertainty: you can't be sure about your results one way or another.
Did you have another term related to Covid-19 that needs an explanation? Check out these statistics definitions, which cover (with a confidence level of 90%) every statistical topic you're likely to come across.
Delameter et al. (2019). Complexity of the Basic Reproduction Number (R0).
Murthy et al. (2020). Care for Critically Ill Patients With COVID-19.
Transmission of the Novel Coronavirus: Early Findings
Early dynamics of transmission and control of COVID-19: a mathemati...
Manual for the Laboratory-based Surveillance of Measles, Rubella, a...
Comment
Left censored. We are only seeing the worst cases (only the very ill). So the various rates can be misleading until widespread testing (even of "healthy people").
"Generally, humans are immuned after infection by all of the "common cold" Coronovirii (I think there are 4?) for 6 to 9 months. That's why you get it every winter."
That's probably true of some viruses but certainly not all. The adaptive immune system (i.e. Lymphocytes, or T-cells/"killer cells") kill off viruses, most of the killer cells die off and a small number of "memory" cells hang out in the body, perhaps all your life, waiting to nab/suppress the germs if they come back. Most people have variety of viruses that the immune systems keeps bottled up all their life (hence the reason that if you destabilize the immune system enough you may allow reactivation of things TB, shingles or even more nasty things.)
Thanks, Paul. I'm going to look at modeling a little more in depth. I agree that common sense sometimes goes out of the window when it comes to statistics. Sometimes people just get lost in the data and alienate themselves from the real world.
Good write-up and interesting references to the COVID-19 problem.
It might be worth doing a similar analysis of the use of statistics (specifically forecasting/modeling) as it relates to COVID-19. From what I can see, the UK Imperial college analysis that seemed to finally change minds in the White House last weekend is based upon the assumption that this virus will continue unabated until a vaccine is developed. The result is projections of over a million dead in the US and multiple recurrences of the epidemic which would essentially wreck the US economy (and presumably the world also), pretty much regardless of whatever social distancing measures are used.
For that to happen, you’d have to believe that the human immune system in infected individuals does not develop an immunity to the virus, which seems pretty questionable in my view. We already know, for example, that the virus is unable to kill people as they “get younger.” There’s a clear relationship between mortality and increasing age (this is typical of many viruses such as MERS and SARS and is presumably due to the decreasing abilities of the adaptive immune system as people age) so that roughly 80% of the fatalities are older people. Researchers have also apparently discovered that macaques infected with the virus develop an immunity, which would suggest the same thing might happen in humans.
It’s understandable that people who are modeling COVID-19 can’t make an assumption about facts that haven’t been established. But it would seem to me that they are violating the central rule of modeling in not providing “sensitivity analysis” on the issue of whether immunity might end this pandemic sooner than the 18 months it takes to develop a vaccine.
Unfortunately, the result of all this are horrendous projections about what might happen. You’d think that statistical modelers might be a bit more careful about what they’re doing and provide difference perspectives.
Anyway, food for thought. Technically, this is not a statistical/data science issue but it probably illustrates how statisticians/data scientists sometimes need to enforce a little common sense amongst the subject matter experts.
Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand
Date: 16 March 2020
Authors: Imperial College COVID 19 Response Team
https://www.imperial.ac.uk/media/imperial-college/medicine/sph/ide/...
© 2020 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central