Statistical analysis is poorly misunderstood by many, resulting in array of problems.There are many ways to ruin your data analysis.4 of the less well-known mistakes to watch out for. I recently came across an article in eLife (a peer reviewed journal) called Ten common statistical mistakes to watch out for when writing or reviewing a manuscript [1]. Although the paper was geared towards reviewers of research papers, many of the items apply to data analysis in general. According to the authors, mistakes happen because of a poor choice of experimental design, incorrect analysis tools, or flawed reasoning. Fairly well-known problem areas mentioned in the paper include spurious correlation [no term], misleading graphs [no term], or not correcting for multiple comparisons [no term]. However, there are many more less obvious ways to taint your results. The following four categories are surprisingly common analysis pitfalls--even in peer-reviewed, published research.1. Comparing Groups IndirectlyComparing two groups isn't as easy as running a t-test [no term] and drawing a conclusion about effect sizes [no term]. For example, you might find a significant [no term] effect in one group and not another. Based on that, you might draw a conclusion about the effect in the "significant" group being larger than the non-significant group. But this is a faulty conclusion: it's possible for two groups to have near-identical correlations even if one has a statistically significant result and the other does not. Here's an example the authors of Ten Common Statistical Mistakes shared about how this may happen:The above image shows two groups (blue, red) that share a similar correlation. However, if you were to compare correlations to zero with Pearson's r [no term], it's possible to find that one group has a statistically significant correlation while the other does not. If you were to conclude that the groups correlation's differ based on this analysis, you would be wrong.The solution is to only compare groups directly. Once you break the groups apart and start comparing them to something other than the other group (like zero or a hypothetical mean), you're going to run into problems. Correlations can be compared with Monte Carlo simulations. ANOVA may also work for group comparisons.2. P-hackingIf you get a large p-value [no term] from your analysis, but were expecting a small (i.e. significant) one, don't be tempted to use hacks in the hunt for significance. For example, don't be tempted to:Add covariates,Remove outliers or other data points post hoc,Switch outcome parameters,Vary the analysis pipeline.All of these make it more likely you'll get significant results, which means it's more likely you'll get a false positive result. To ensure your results are solid, make sure to follow standard analytic approaches--and don't deviate from the plan once you've started. If you do decide to make changes after you've started, make sure to delineate planned from exploratory findings. If possible, repeat the process with a replication study to confirm your results.3. Circular AnalysisCircular analysis is "any form of analysis that retrospectively selects features of the data to characterise the dependent variables, resulting in a distortion of the resulting statistical test" [1]. In essence, it involves recycling data or "double dipping" to get what you want. For example, splitting the data set into subgroups, binning or otherwise reducing the complete data set. This is fine when you are planning your analysis, but never sort. filter or tinker with your data set during analysis unless you're clearly using exploratory methods. Otherwise you run the risk of distorted results and invalid statistical inferences.To avoid circular analysis, define the analysis criteria in advance and independently of the data. Alternatively, use bootstrapping to to specify parameters and test predictions on a different dataset (or a subset of the dataset) . 4. Small Sample SizesSmall samples are tricky and come with problems like:Bigger, biased, effect sizes for significant effects,Potentially missing effects (Type II error),Problems assessing normality (an assumption for many parametric tests),P-values have limited practical value.There's no "magic number" for how small is "too small", but use caution with parametric tests if you only have a handful of data points. To avoid issues with small samples, consider using Bayesian statistics to determine the power for an effect post hoc. Also consider replication studies to confirm your findings.References[1] Ten common statistical mistakes to watch out for when writing or reviewing a manuscriptSee More

Statistical analysis is poorly misunderstood by many, resulting in array of problems.There are many ways to ruin your data analysis.4 of the less well-known mistakes to watch out for. I recently came across an article in eLife (a peer reviewed journal) called Ten common statistical mistakes to watch out for when writing or reviewing a manuscript [1]. Although the paper was geared towards reviewers of research papers, many of the items apply to data analysis in general. According to the authors, mistakes happen because of a poor choice of experimental design, incorrect analysis tools, or flawed reasoning. Fairly well-known problem areas mentioned in the paper include spurious correlation [no term], misleading graphs [no term], or not correcting for multiple comparisons [no term]. However, there are many more less obvious ways to taint your results. The following four categories are surprisingly common analysis pitfalls--even in peer-reviewed, published research.1. Comparing Groups IndirectlyComparing two groups isn't as easy as running a t-test [no term] and drawing a conclusion about effect sizes [no term]. For example, you might find a significant [no term] effect in one group and not another. Based on that, you might draw a conclusion about the effect in the "significant" group being larger than the non-significant group. But this is a faulty conclusion: it's possible for two groups to have near-identical correlations even if one has a statistically significant result and the other does not. Here's an example the authors of Ten Common Statistical Mistakes shared about how this may happen:The above image shows two groups (blue, red) that share a similar correlation. However, if you were to compare correlations to zero with Pearson's r [no term], it's possible to find that one group has a statistically significant correlation while the other does not. If you were to conclude that the groups correlation's differ based on this analysis, you would be wrong.The solution is to only compare groups directly. Once you break the groups apart and start comparing them to something other than the other group (like zero or a hypothetical mean), you're going to run into problems. Correlations can be compared with Monte Carlo simulations. ANOVA may also work for group comparisons.2. P-hackingIf you get a large p-value [no term] from your analysis, but were expecting a small (i.e. significant) one, don't be tempted to use hacks in the hunt for significance. For example, don't be tempted to:Add covariates,Remove outliers or other data points post hoc,Switch outcome parameters,Vary the analysis pipeline.All of these make it more likely you'll get significant results, which means it's more likely you'll get a false positive result. To ensure your results are solid, make sure to follow standard analytic approaches--and don't deviate from the plan once you've started. If you do decide to make changes after you've started, make sure to delineate planned from exploratory findings. If possible, repeat the process with a replication study to confirm your results.3. Circular AnalysisCircular analysis is "any form of analysis that retrospectively selects features of the data to characterise the dependent variables, resulting in a distortion of the resulting statistical test" [1]. In essence, it involves recycling data or "double dipping" to get what you want. For example, splitting the data set into subgroups, binning or otherwise reducing the complete data set. This is fine when you are planning your analysis, but never sort. filter or tinker with your data set during analysis unless you're clearly using exploratory methods. Otherwise you run the risk of distorted results and invalid statistical inferences.To avoid circular analysis, define the analysis criteria in advance and independently of the data. Alternatively, use bootstrapping to to specify parameters and test predictions on a different dataset (or a subset of the dataset) . 4. Small Sample SizesSmall samples are tricky and come with problems like:Bigger, biased, effect sizes for significant effects,Potentially missing effects (Type II error),Problems assessing normality (an assumption for many parametric tests),P-values have limited practical value.There's no "magic number" for how small is "too small", but use caution with parametric tests if you only have a handful of data points. To avoid issues with small samples, consider using Bayesian statistics to determine the power for an effect post hoc. Also consider replication studies to confirm your findings.References[1] Ten common statistical mistakes to watch out for when writing or reviewing a manuscriptSee More

Discriminative and generative models have distinct differences,Discriminative methods are simpler but not necessarily better,This One Picture outlines a few major differences between the methods, along with a few examples and use cases.Read any textbook on the difference between generative and discriminative models and you'll usually find the explanation is usually less than intuitive. One of the better explanations I've read [1] is the following analogy: you are talking to someone who is speaking a language you don't understand. Your task is to figure out what language they are speaking (without the use of Google Translate!). You have two options:Generative method: Learn each language and then use that knowledge to determine which language is being spoken.Discriminative method: Determine the linguistic differences without actually learning the language.As you may be able to tell from this simple analogy. discriminative methods are much simpler. But there are situations when you might want to use generative methods instead. For example, generative models are much better for finding missing values (i.e. they can generate data). However, if classification accuracy is your goal, a discriminative method, which "discriminates" (classifies) is likely the better choice. In sum:References[1] Machine Learning: Generative and Discriminative Models[2] Reasoning about Missing Data in Machine Learning[3] Jebara, T. (2019). Machine Learning Discriminative and Generative. Springer.See More

An interesting application of Benford's and Zipf's Laws to fraudulent Covid-19 data was published recently on the peer-reviewed PLOS-ONE website. The study, titled On the authenticity of COVID-19 case figures [1] showed how the two laws could be used to identify fraudulent Covid-19 data.The "Usual" Applications of Benford's and Zipf's LawsBenford’s law [no term] is a probability distribution [no term] for the likelihood of the first digit in a set of numbers [2]. The idea is that there are set probabilities for certain numbers (1, 2, 3,…) occurring in certain lists of numbers. The probability of any particular number occurring is given by a formula: As well as the fairly mundane application of analyzing data in written texts like The Times or Reader's Digest, it can be applied to analyze large sets of naturally occurring numbers like stock market prices, income distributions, or river drainage rates.Zipf's law, which gives rise to the Zeta probability distribution, [no term] states that, given a list of the most frequent words in a random book, the most common word will appear twice as often as the second most common, which will appear twice as often as the third most common, and so on. Practical applications include analysis of data that has large steps down from the top, like the drop from executive salaries to management salaries or the number of books ales by top authors (like Stephen King or J.K. Rowling) compared to lesser know authors.Interestingly, both distributions can also be employed to detect fraud. Benford's law detects the probability of a particular number appearing first, so any large set of numbers should follow set probabilities. If fraudsters are constantly inflating figures (e.g. increasing their average bill from $100 to $900), the data set will upset the natural order of data-- violating Benford's law and producing an anomaly that can be picked up by an algorithm. The law has a long history of detecting both financial and non-financial fraud, including electoral fraud [3], scientific fraud [4], and fake law enforcement statistics [5]. Kennedy and Yam, the authors of this new study took an unusual step forward and applied Zipf's law (in addition to Benford's) to finding fraud in published Covid-19 data.Fraud in Covid-19 DataIn the early days of the pandemic, it was difficult (if not impossible) to ascertain how serious the Covid-19 outbreak actually was. For example, Chinese authorities downplayed the outbreak's threat, allowing countless numbers of carriers to travel globally [6]. But how can we be certain that the case numbers coming out of China (and indeed, every other country) were fraudulent? Were countries deliberately dampening figures in order to keep tourism alive? One way to answer these questions is to apply the two abovementioned laws to the data, and see if any anomalies crop up. This is exactly what Kennedy and Yam did in their study.One of the questions the authors tried to address is that when they found fraudulent data, did it come from a central government or local authorities feeding incorrect data to the top? The authors found that Zipf's law can help answer this question: significant deviations from Zipf's law might indicate fraud by figures reported by a central government. However, if a country's figures largely follow Zipf's law with a few deviations on the sub-national level, that might indicate fraud by regional authorities. The authors also noted that during the early days of an epidemic, where there are few cases per capita, the number of confirmed cases seems to obey Benford’s law to a "large degree." This gives a very simple and practical way to ascertain whether reported numbers of cases in any pandemic--especially in the early days--are likely to be fraudulent.Applying these laws and identifying which data is false and which is true can result in a much faster, better response to future pandemics. It's worth noting though that at this stage the study's result are purely theoretical as they have been applied to the current Covid-19 pandemic, which has resulted in government action on an unprecedented scale. Further study of past and future pandemics is necessary to confirm the validity of applying these statistical laws the data from pandemics.References[1] On the authenticity of COVID-19 case figures[2] Frunza, M. (2015). Solving Modern Crime in Financial Markets: Analytics and Case Studies. Academic Press.[3] Klimek P, Jiménez R, Hidalgo M, Hinteregger A, Thurner S. Forensic analysis of Turkish elections in 2017-2018. PLOS ONE. 2018;13(10). pmid:30289899[4] Diekmann A. Not the first digit! Using Benford’s law to detect fraudulent scientific data. Journal of Applied Statistics. 2007;34(3):321–329.[5] Deckert J, Myagkov M, Ordeshook PC. Benford’s law and the detection of election fraud. Political Analysis. 2011;19(3):245–268.[6] 8 times world leaders downplayed the coronavirus and put their countries at greater risk for infectionCovid-19 image: CDC (Public Domain).See More

An interesting application of Benford's and Zipf's Laws to fraudulent Covid-19 data was published recently on the peer-reviewed PLOS-ONE website. The study, titled On the authenticity of COVID-19 case figures [1] showed how the two laws could be used to identify fraudulent Covid-19 data.The "Usual" Applications of Benford's and Zipf's LawsBenford’s law [no term] is a probability distribution [no term] for the likelihood of the first digit in a set of numbers [2]. The idea is that there are set probabilities for certain numbers (1, 2, 3,…) occurring in certain lists of numbers. The probability of any particular number occurring is given by a formula: As well as the fairly mundane application of analyzing data in written texts like The Times or Reader's Digest, it can be applied to analyze large sets of naturally occurring numbers like stock market prices, income distributions, or river drainage rates.Zipf's law, which gives rise to the Zeta probability distribution, [no term] states that, given a list of the most frequent words in a random book, the most common word will appear twice as often as the second most common, which will appear twice as often as the third most common, and so on. Practical applications include analysis of data that has large steps down from the top, like the drop from executive salaries to management salaries or the number of books ales by top authors (like Stephen King or J.K. Rowling) compared to lesser know authors.Interestingly, both distributions can also be employed to detect fraud. Benford's law detects the probability of a particular number appearing first, so any large set of numbers should follow set probabilities. If fraudsters are constantly inflating figures (e.g. increasing their average bill from $100 to $900), the data set will upset the natural order of data-- violating Benford's law and producing an anomaly that can be picked up by an algorithm. The law has a long history of detecting both financial and non-financial fraud, including electoral fraud [3], scientific fraud [4], and fake law enforcement statistics [5]. Kennedy and Yam, the authors of this new study took an unusual step forward and applied Zipf's law (in addition to Benford's) to finding fraud in published Covid-19 data.Fraud in Covid-19 DataIn the early days of the pandemic, it was difficult (if not impossible) to ascertain how serious the Covid-19 outbreak actually was. For example, Chinese authorities downplayed the outbreak's threat, allowing countless numbers of carriers to travel globally [6]. But how can we be certain that the case numbers coming out of China (and indeed, every other country) were fraudulent? Were countries deliberately dampening figures in order to keep tourism alive? One way to answer these questions is to apply the two abovementioned laws to the data, and see if any anomalies crop up. This is exactly what Kennedy and Yam did in their study.One of the questions the authors tried to address is that when they found fraudulent data, did it come from a central government or local authorities feeding incorrect data to the top? The authors found that Zipf's law can help answer this question: significant deviations from Zipf's law might indicate fraud by figures reported by a central government. However, if a country's figures largely follow Zipf's law with a few deviations on the sub-national level, that might indicate fraud by regional authorities. The authors also noted that during the early days of an epidemic, where there are few cases per capita, the number of confirmed cases seems to obey Benford’s law to a "large degree." This gives a very simple and practical way to ascertain whether reported numbers of cases in any pandemic--especially in the early days--are likely to be fraudulent.Applying these laws and identifying which data is false and which is true can result in a much faster, better response to future pandemics. It's worth noting though that at this stage the study's result are purely theoretical as they have been applied to the current Covid-19 pandemic, which has resulted in government action on an unprecedented scale. Further study of past and future pandemics is necessary to confirm the validity of applying these statistical laws the data from pandemics.References[1] On the authenticity of COVID-19 case figures[2] Frunza, M. (2015). Solving Modern Crime in Financial Markets: Analytics and Case Studies. Academic Press.[3] Klimek P, Jiménez R, Hidalgo M, Hinteregger A, Thurner S. Forensic analysis of Turkish elections in 2017-2018. PLOS ONE. 2018;13(10). pmid:30289899[4] Diekmann A. Not the first digit! Using Benford’s law to detect fraudulent scientific data. Journal of Applied Statistics. 2007;34(3):321–329.[5] Deckert J, Myagkov M, Ordeshook PC. Benford’s law and the detection of election fraud. Political Analysis. 2011;19(3):245–268.[6] 8 times world leaders downplayed the coronavirus and put their countries at greater risk for infectionCovid-19 image: CDC (Public Domain).See More

]]>

Ten-fold differences have been reported in Covid-19 death rates.Estimate issues are because of how the data is calculated.Why the same data can yield gross overestimates and underestimates. What is the actual death rate for Covid-19? After nearly a year of the pandemic, no one can agree on an answer. Depending on which expert you ask, it's somewhere between 0.53% and 6% for the general population and possibly as high as 35% for older patients with certain pre-existing conditions. These widely varying figures illustrate how difficult it is to take data and make predictions--even with the best data and machine learning tools at your disposal.Part of the problem is clarity: news sources and blogs in particular cite statistics without clarifying exactly what statistic they are talking about. For example, one MedPageToday article [1] mentions a "fatality rate"; The article doesn't make it immediately clear if that fatality rate applies to hospitalized cases, cases who have tested positive, or the population as a whole (each of these rates would be vastly different).Despite Covid-19- fueling one of the largest explosions of scientific literature in history [2], we're not even close to accurately figuring out what percentage of the population the virus actually kills. All of the reported figures amount to nothing more than data-driven guesswork.The Anatomy of a Death Rate CalculationA recent episode of The Guardian's Science Weekly Update [3] addressed the question of why Covid-19 fatality rates vary so much. In the program, Paul Hunter, professor of medicine at the University of East Anglia, explains that the figures such as the World Health Organization's reported death rate of 3.4 percent was calculated by the number of Covid-19 related deaths (as recorded on death certificates) divided by the number of confirmed cases (based on positive Covid-19 tests). That figure, called the "case fatality rate (CFR)" is is a statistic that Dr. Robert Pearl M.D. calls "inaccurate and misleading" [4]. Why? Depending on how you look at it, it's either a gross underestimate or overestimate.The estimation issues arise because of how the figures are calculated. The CFR is calculated by recording people at the beginning of their illness and at the end of their illness--people who are still ill when the data is being recorded may still go on to die after the figures have been tallied. The 3.4% then is an underestimate--the people who are currently sick and go on to die will push that percentage up to around 5 to 6%. Although around 5% might seem like a reasonable estimate (probably one that matches figures you've often seen in the news), note how the above figure is obtained in the first place-- the number of deaths divided by the number of cases. There are an unknown number of people, possibly up to 10 times higher than the official counts [5] of people with the virus who don't get tested. If we could count all of these cases, most of which are probably mild or asymptomatic, then the death rate would be significantly lower, meaning that 3.4% is actually an overestimate. The actual number of deaths as it relates to the actual number of cases in the population is called the "true infection fatality rate (IFR)" and may be as low as 0.53% [7]. The solution to more accurate reporting seems clear: find more cases. But this isn't as easy as it sounds. One recent study showed that in France, a paltry 10% of Covid-19 cases were actually detected [8].Throwing in a Few More ComplicationsComplicating matters even further is that, geographically speaking, detection rates also vary widely . When you try to compare death rate between countries, there may be more than a 20-fold difference in identified cases [5]. Other issues that have lead to overestimates include not accounting for an aging population [6] or the presence of pre-existing medical conditions; The fatality rate for younger, healthier individuals is significantly lower than for older individuals with pre-existing conditions. Researchers at Johns Hopkins used machine learning to discover that age is the strongest predictor of who dies from Covid-19, ranging from a 1% fatality rate for the under-50s to a whopping 34 percent for those over age 85. However, those figures were also based on patients who are symptomatic and is therefore also an overestimate of death risk. Calculating Covid-19 death rates in the population is a challenge, and case counts are unreliable. In general, we can say that the organizations that are better at identifying mild cases will have the most accurate figures. However, identifying which organization is more "accurate" at this task is a challenge in itself.Data Doesn't Always Tell the Right PictureThe fact is, an analysis is only going to be as good as the data at hand. Collecting and analyzing data opens up a myriad of possible statistical biases, [no term] all of which can completely ruin your analysis. And then--assuming you have reliable [no term] data--it then becomes a matter of clearly communicating your results to the general public: a matter which as the above example shows, is no easy task.References[1] Here's Why COVID-19 Mortality Has Dropped[2] Scientists are drowning in COVID-19 papers. Can new tools keep them afloat?[3] Covid-19: why are there different fatality rates? – Science Weekly Podcast[4] Three Misleading, Dangerous Coronavirus Statistics[5] Estimating the Number of SARS-CoV-2 Infections and the Impact of Mitigation Policies in the United States[6] Impact of Population Growth and Aging on Estimates of Excess U.S. Deaths During the COVID-19 Pandemic, March to August 2020[7] A systematic review and meta-analysis of published research data on COVID-19 infection fatality rates[8] COVID research updates: How 90% of French COVID cases evaded detectionImage: CDC (Public Domain)See More

Ten-fold differences have been reported in Covid-19 death rates.Estimate issues are because of how the data is calculated.Why the same data can yield gross overestimates and underestimates. What is the actual death rate for Covid-19? After nearly a year of the pandemic, no one can agree on an answer. Depending on which expert you ask, it's somewhere between 0.53% and 6% for the general population and possibly as high as 35% for older patients with certain pre-existing conditions. These widely varying figures illustrate how difficult it is to take data and make predictions--even with the best data and machine learning tools at your disposal.Part of the problem is clarity: news sources and blogs in particular cite statistics without clarifying exactly what statistic they are talking about. For example, one MedPageToday article [1] mentions a "fatality rate"; The article doesn't make it immediately clear if that fatality rate applies to hospitalized cases, cases who have tested positive, or the population as a whole (each of these rates would be vastly different).Despite Covid-19- fueling one of the largest explosions of scientific literature in history [2], we're not even close to accurately figuring out what percentage of the population the virus actually kills. All of the reported figures amount to nothing more than data-driven guesswork.The Anatomy of a Death Rate CalculationA recent episode of The Guardian's Science Weekly Update [3] addressed the question of why Covid-19 fatality rates vary so much. In the program, Paul Hunter, professor of medicine at the University of East Anglia, explains that the figures such as the World Health Organization's reported death rate of 3.4 percent was calculated by the number of Covid-19 related deaths (as recorded on death certificates) divided by the number of confirmed cases (based on positive Covid-19 tests). That figure, called the "case fatality rate (CFR)" is is a statistic that Dr. Robert Pearl M.D. calls "inaccurate and misleading" [4]. Why? Depending on how you look at it, it's either a gross underestimate or overestimate.The estimation issues arise because of how the figures are calculated. The CFR is calculated by recording people at the beginning of their illness and at the end of their illness--people who are still ill when the data is being recorded may still go on to die after the figures have been tallied. The 3.4% then is an underestimate--the people who are currently sick and go on to die will push that percentage up to around 5 to 6%. Although around 5% might seem like a reasonable estimate (probably one that matches figures you've often seen in the news), note how the above figure is obtained in the first place-- the number of deaths divided by the number of cases. There are an unknown number of people, possibly up to 10 times higher than the official counts [5] of people with the virus who don't get tested. If we could count all of these cases, most of which are probably mild or asymptomatic, then the death rate would be significantly lower, meaning that 3.4% is actually an overestimate. The actual number of deaths as it relates to the actual number of cases in the population is called the "true infection fatality rate (IFR)" and may be as low as 0.53% [7]. The solution to more accurate reporting seems clear: find more cases. But this isn't as easy as it sounds. One recent study showed that in France, a paltry 10% of Covid-19 cases were actually detected [8].Throwing in a Few More ComplicationsComplicating matters even further is that, geographically speaking, detection rates also vary widely . When you try to compare death rate between countries, there may be more than a 20-fold difference in identified cases [5]. Other issues that have lead to overestimates include not accounting for an aging population [6] or the presence of pre-existing medical conditions; The fatality rate for younger, healthier individuals is significantly lower than for older individuals with pre-existing conditions. Researchers at Johns Hopkins used machine learning to discover that age is the strongest predictor of who dies from Covid-19, ranging from a 1% fatality rate for the under-50s to a whopping 34 percent for those over age 85. However, those figures were also based on patients who are symptomatic and is therefore also an overestimate of death risk. Calculating Covid-19 death rates in the population is a challenge, and case counts are unreliable. In general, we can say that the organizations that are better at identifying mild cases will have the most accurate figures. However, identifying which organization is more "accurate" at this task is a challenge in itself.Data Doesn't Always Tell the Right PictureThe fact is, an analysis is only going to be as good as the data at hand. Collecting and analyzing data opens up a myriad of possible statistical biases, [no term] all of which can completely ruin your analysis. And then--assuming you have reliable [no term] data--it then becomes a matter of clearly communicating your results to the general public: a matter which as the above example shows, is no easy task.References[1] Here's Why COVID-19 Mortality Has Dropped[2] Scientists are drowning in COVID-19 papers. Can new tools keep them afloat?[3] Covid-19: why are there different fatality rates? – Science Weekly Podcast[4] Three Misleading, Dangerous Coronavirus Statistics[5] Estimating the Number of SARS-CoV-2 Infections and the Impact of Mitigation Policies in the United States[6] Impact of Population Growth and Aging on Estimates of Excess U.S. Deaths During the COVID-19 Pandemic, March to August 2020[7] A systematic review and meta-analysis of published research data on COVID-19 infection fatality rates[8] COVID research updates: How 90% of French COVID cases evaded detectionImage: CDC (Public Domain)See More

Data analysis is fraught with pitfalls, including too-small sample sizes.Bias can creep into the most well-intentioned of studies,Tips to avoid bias and choose the best statistical test.So you've formed your breakthrough hypothesis, created a bulletproof test procedure and waited eagerly for the results to come in. To your surprise, the magnificent effect you were sure of just isn't there. What went wrong? Was it the way your hypothesis was worded? An error in the statistical calculations or the way the data was collected? While many errors can easily creep into studies, one of the most likely suspects is simply that your study had too few participants to show an effect. Other common pitfalls include misuse of appropriate statistical tests, or using the wrong one in the first place. And let's not forget bias: If you are sure that you don't have any bias in your results, then you're probably wrong (and might want to check again!). Sample Sizes too smallTom Brady, writing for SealedEnvelope.com writes that "Many studies are too small to detect even large effects" [1].To ensure your carefully planned (and likely very expensive) study stands a good chance of showing effects, you have to pick the correct sample size. Too large of a sample size and you'll run out of cash, fast. Too small of a sample and you're doomed to failure before you can run that glorious chi-square test [no term]. So the question becomes...what is the "ideal" sample size? Unfortunately, there isn't a clear answer; Finding the right number is more of an art than a science. A few general tips to get you started [2]:Conduct a census [no term]. That is, if at all possible, ask everyone in your population. Works well if you have 1,000 potential data points or less.Use a sample size from a similar study. It's hard to reinvent the wheel, but you may not have to. The chances are, someone, somewhere has performed a similar study. Search the literature (Google Scholar might be a good place to start) to see if you can locate another study. If your study is fairly generic, you might also be able to identify an optimum sample from a published table, or an online sample size calculator.Use a Formula like Cochran's Sample Size Formula [no term]. These aren't always easy, because you'll usually have to know a little about what your expecting to find. For example, Cochran's requires you to make a guess about the portion of the population that has the attribute you're interested in.BiasBias [no term] is where your results overestimate or underestimate the population parameter of interest. It's practically impossible to completely avoid every bias; There are dozens of ways it can creep in to every phase of research, from planning to publication [3]. However, you can take steps to avoid it by carefully designing and implementing your study. Some general tips in avoiding bias:Always use a random selection method for your sample (e.g. SRS),[no term]Use blinding if applicable,Control for confounding variables [no term]. Confounding variables are unwanted extras in your analysis. For example, if you're studying the effect of activity level on weight gain, then age is a confounding variable (e.g. teens are less likely to put on weight than someone of middle age).Study the different types of bias to help you identify problem areas.Ensure that any sources you're using are objective. For example, don't refer to a lung cancer study that has been paid for by a tobacco manufacturer.Use the Appropriate TestThe "go to" tests (e.g. chi-square, t-test) [no term] are well understood and widely implemented. don't be tempted to run a more obscure test unless you can justify your choice and provide objective references from the literature. There are situations where you might want to run unusual tests. For example, Vincent Granville writes [4] "I used these tests mostly in the context of experimental mathematics, [where] the theoretical answer to a statistical test is sometimes known, making it a great benchmarking tool to assess the power of these tests, and determine the minimum sample size to make them valid." As an example, instead of the chi-square test for normality, you could split your data into two equal sets {X, Y}, then test if (X + Y) / SQRT(2) has the same distribution as Z. According to Granville, this works as long as you don't have an infinite theoretical variance.Some general tips on what to do when choosing a test:Consider the type of variables you have. For example, make sure to account for ordered variables and be careful that your "independent samples" aren't paired or dependent.Don't dichotomize continuous variables in your analysis,Don't use parametric methods unless you've verified your residuals or outcome is normally distributed,Use two-tailed tests instead of one-tailed if at all possible,Avoid p-values unless you're intimately familiar with their pitfalls. For example, a small p-value isn't always "better" than a bigger one [5]. Use confidence intervals instead.If you have the statistical knowledge to assess find and analyze those theoretical answers, obscure tests can be a break from the monotony. But if statistics isn't your forte, then you should probably stick to the usual suspects. Not sure where to start? See my previous post on choosing the right statistical test.References[1] Reviewer's quick guide to common statistical errors in scientific papers[2] Sample size in statistics[3] Identifying and Avoiding Bias in Research[4] A Plethora of Original, Not Well-Known Statistical Tests[5] Common pitfalls in statistical analysis: “P” values, statistical significance and confidence intervalsSee More

Data analysis is fraught with pitfalls, including too-small sample sizes.Bias can creep into the most well-intentioned of studies,Tips to avoid bias and choose the best statistical test.So you've formed your breakthrough hypothesis, created a bulletproof test procedure and waited eagerly for the results to come in. To your surprise, the magnificent effect you were sure of just isn't there. What went wrong? Was it the way your hypothesis was worded? An error in the statistical calculations or the way the data was collected? While many errors can easily creep into studies, one of the most likely suspects is simply that your study had too few participants to show an effect. Other common pitfalls include misuse of appropriate statistical tests, or using the wrong one in the first place. And let's not forget bias: If you are sure that you don't have any bias in your results, then you're probably wrong (and might want to check again!). Sample Sizes too smallTom Brady, writing for SealedEnvelope.com writes that "Many studies are too small to detect even large effects" [1].To ensure your carefully planned (and likely very expensive) study stands a good chance of showing effects, you have to pick the correct sample size. Too large of a sample size and you'll run out of cash, fast. Too small of a sample and you're doomed to failure before you can run that glorious chi-square test [no term]. So the question becomes...what is the "ideal" sample size? Unfortunately, there isn't a clear answer; Finding the right number is more of an art than a science. A few general tips to get you started [2]:Conduct a census [no term]. That is, if at all possible, ask everyone in your population. Works well if you have 1,000 potential data points or less.Use a sample size from a similar study. It's hard to reinvent the wheel, but you may not have to. The chances are, someone, somewhere has performed a similar study. Search the literature (Google Scholar might be a good place to start) to see if you can locate another study. If your study is fairly generic, you might also be able to identify an optimum sample from a published table, or an online sample size calculator.Use a Formula like Cochran's Sample Size Formula [no term]. These aren't always easy, because you'll usually have to know a little about what your expecting to find. For example, Cochran's requires you to make a guess about the portion of the population that has the attribute you're interested in.BiasBias [no term] is where your results overestimate or underestimate the population parameter of interest. It's practically impossible to completely avoid every bias; There are dozens of ways it can creep in to every phase of research, from planning to publication [3]. However, you can take steps to avoid it by carefully designing and implementing your study. Some general tips in avoiding bias:Always use a random selection method for your sample (e.g. SRS),[no term]Use blinding if applicable,Control for confounding variables [no term]. Confounding variables are unwanted extras in your analysis. For example, if you're studying the effect of activity level on weight gain, then age is a confounding variable (e.g. teens are less likely to put on weight than someone of middle age).Study the different types of bias to help you identify problem areas.Ensure that any sources you're using are objective. For example, don't refer to a lung cancer study that has been paid for by a tobacco manufacturer.Use the Appropriate TestThe "go to" tests (e.g. chi-square, t-test) [no term] are well understood and widely implemented. don't be tempted to run a more obscure test unless you can justify your choice and provide objective references from the literature. There are situations where you might want to run unusual tests. For example, Vincent Granville writes [4] "I used these tests mostly in the context of experimental mathematics, [where] the theoretical answer to a statistical test is sometimes known, making it a great benchmarking tool to assess the power of these tests, and determine the minimum sample size to make them valid." As an example, instead of the chi-square test for normality, you could split your data into two equal sets {X, Y}, then test if (X + Y) / SQRT(2) has the same distribution as Z. According to Granville, this works as long as you don't have an infinite theoretical variance.Some general tips on what to do when choosing a test:Consider the type of variables you have. For example, make sure to account for ordered variables and be careful that your "independent samples" aren't paired or dependent.Don't dichotomize continuous variables in your analysis,Don't use parametric methods unless you've verified your residuals or outcome is normally distributed,Use two-tailed tests instead of one-tailed if at all possible,Avoid p-values unless you're intimately familiar with their pitfalls. For example, a small p-value isn't always "better" than a bigger one [5]. Use confidence intervals instead.If you have the statistical knowledge to assess find and analyze those theoretical answers, obscure tests can be a break from the monotony. But if statistics isn't your forte, then you should probably stick to the usual suspects. Not sure where to start? See my previous post on choosing the right statistical test.References[1] Reviewer's quick guide to common statistical errors in scientific papers[2] Sample size in statistics[3] Identifying and Avoiding Bias in Research[4] A Plethora of Original, Not Well-Known Statistical Tests[5] Common pitfalls in statistical analysis: “P” values, statistical significance and confidence intervalsSee More