.

# The Death of the Statistical Tests of Hypotheses

Some foundations of statistical science have been questioned recently, especially the use and abuse of p-values. See also this article published in FiveThirtyEight.com. Statistical tests of hypotheses rely on p-values and other mysterious parameters and concepts that only the initiated can understand: power, type I error, type II error, or UMP tests, just to name a few. Pretty much all of us have had to learn this old stuff (pre-dating the existence of computers) in some college classes.

Sometimes results from a statistical test will be published in a mainstream journal - for instance about whether  or not global warming is accelerating - using the same jargon that few understand, and accompanied by misinterpretations and flaws in the use of the test itself. Especially when tests are repeated over and over (or data adulterated or wrongly collected to start with) until they deliver the answer that we want.

While statistical tests of hypotheses will continue to be used in some circles for a long time (medical research for instance) due to a long tradition and maybe compliance with regulations (clinical trials) that kill innovation, I propose here a different solution that eliminates most of the drawbacks, and is especially suited for black-box or automated testing, where decisions are not taken, not even checked by human beings. An example is automated A/B testing, where tests are run daily to optimize conversions on a website, or in automated piloting. My solution - not something new indeed - does not come with statistical distributions, and is easy to compute, even SQL-friendly and Excel-friendly. No need to know math, not even basic probabilities, not even random variables, to understand how it works. You only need to understand percentages.

This methodology is robust by design. It is part of a data science framework (see section 2 in this article), in which many statistical procedures have been revisited to make them simple, scalable, accurate enough without aiming for perfection but instead for speed, and usable by engineers, machine learning practitioners, computer scientists, software engineers, AI and IoT experts, big data practitioners, business analysts, lawyers, doctors, journalists, even in some cases by the layman, and even by machines and API's (as in machine-to-machine communications).

1. Statistical tests of hypotheses revisited

It works better with big and medium-size data than with small data, as it assumes that the data can be binned in a number (at least 100) of random bins, each with enough observations. It works as follows:

Step #1 - Compute a Confidence Interval (CI)

For the parameter you are interested in (say the mean value), compute a confidence interval of desired level (say 95%) using my elementary data-driven, model-free confidence intervals (you will even find an implementation in Excel when clicking on the link) that relies mostly on sorting and percentiles.

Step #2 - Is your tested value inside or outside the CI?

Let me illustrate this with a simple example. You want to see if the average temperature in Seattle, for years 2011-2015, is above the average for the years 2006-2010, in a way that is statistically significant. The mean temperature computed during 2006-2010 is (say) 51.95°F. The mean temperature computed during 2011-2015 is (say) 52.27°F. The 95% confidence interval (CI) for 2006-2010 is (say) [51.80°F, 52.20°F]. Now since 52.27°F is outside the CI, indeed above the upper bound 52.20°F, you can say there is a 2.5% chance (the whole CI covering 95% of the possibilities) that this high 52.27°F measurement occurred just by chance. In short, it is unlikely to have occurred by chance, thus the temperature increase is likely real. It is not very statistically significant though; it would have been if the CI that you computed had a 99.9% level, rather than 95% - but this requires collecting more data - maybe more than what is available.

Remarks

This CI concept is very simple: I did not even have to introduce the concept of one-sided versus two-sided tests. It is something you can easily explain to your boss or to a client.

Evidently, there are better ways to assess if the temperatures are rising. Time series analysis to detect trend after eliminating periodicity or unusual events (El Nino), if done properly, is the solution. This made-up example was provided for illustration purposes only. It also assumes that you have hundreds of data points: calibrated measurements from various weather stations at various times, that are somewhat consistent over time and locales.

Even if you want to stick with a test of hypotheses, a better solution is to compute temperatures differences (called temperature delta's) observed at all locations, 2006-2010 versus 2011-2015, and use a CI for the delta's. You would then check out whether the value 0 falls in your CI for delta. If yes,the difference in temperatures might be explained by luck; if no, probably there is a real change between the two time periods. By proceeding this way, you take into account local temperature variations not only for the 2006-2010 time period, but also for 2006-2010.

Finally, you can use this methodology to run a traditional test of hypotheses: it will yield very similar results. But there's no math in it besides simple high school algebra. And this brings an interesting idea: my methodology can be taught to high school students, to get them interested in data science, without any pain.

Determining the optimum CI level

Typically the level of the underlying CI  (95% in the above example) is decided beforehand. However, if the tests are part of a process to optimize some yield metrics, for instance when optimizing conversions on websites day after day, you can choose the level, within a specific range, say 65% to 99.9% -  that produces the best performance. This is done by splitting your data into multiple independent buckets, using a different level for each bucket, and choose which level works best overall, after following results for some time. The level itself could be one of your parameters in your A/B testing.

2. The new statistical framework

Over years, I have designed a new, unified statistical framework for big data, data science, machine learning, and related disciplines. The tests of hypotheses described above fit in this framework. So far, here are the components of this framework. In parenthesis, you will find the equivalent in traditional (Bayesian or not) statistical science. Some of these techniques may be included in an upcoming course on automated data science, or added to our data science apprenticeship (for self-learners).

I have also written quite a bit on time series (detection of accidental high correlations in big data, change point detection, multiple periodicities), correlation and causation, clustering for big data, random numbers, simulation, ridge regression (approximate solutions) and synthetic metrics (new variances, bumpiness coefficient, robust correlation metric and robust R-squared non sensitive to outliers.) I also explained how to make video from data (using R), even sound files. My next related article will be Variance, Clustering, and Density Estimation Revisited.

All of this will be available in my upcoming book data science 2.0.

Conclusion

A book such as Handbook of Parametric and Non Parametric Statistical procedures  lists dozens of statistical tests of hypotheses. Likewise, dozens of regression techniques are available. Which one to choose? It is as if statistical science has become a field with a bunch of scattered methods, just like accounting that has hundreds of rules yet most of them won't save you much money: they are designed to preserve job security for the practitioners. Even the expert gets sometimes confused, and this artificial complexity easily leads to abuse and misuse.

As a former statistician, confronted with this issue - working mostly with business people and on big data - I decided to simplify much of the theory, and make this field more united and approachable. Also, in designing this new framework, an important factor was to offer solutions that are easy to automate, understand, scale, and interpret. Including by the people who either use or pay for the statistical services.

About the author: Vincent Granville worked for Visa, eBay, Microsoft, Wells Fargo, NBC, a few startups and various organizations, to optimize business problems, boost ROI or to develop ROI attribution models, developing new techniques and systems to leverage modern big data and deliver added value. Vincent owns several patents, published in top scientific journals, raised VC funding, and founded a few startups. Vincent also manages his own self-funded research lab, focusing on simplifying, unifying, modernizing, automating, scaling, and dramatically optimizing statistical techniques. Vincent's focus is on producing robust, automatable tools, API's and algorithms that can be used and understood by the layman, and at the same time adapted to modern big, fast-flowing, unstructured data. Vincent is a post-graduate from Cambridge University.

Related article

DSC Resources

Views: 72533

Comment

Join Data Science Central

Comment by Christian Cantos on August 16, 2017 at 6:31am

Hi, it's late, sorry

For some reason I hit the problem of p-value aka significance in Statistical testing. I wrote this on my blog.

Yet another contribution to the P-value discussion : Probabilities are maths, not logic.

Please would you kindly have a look at

https://chcantos.blogspot.fr/

Thanks

Comment by Christie Haskell on July 7, 2016 at 11:43am

Another simple approach that I'm using to communicate the results of A/B tests to Designers with no statistics background is to calculate the Cohen's d between variants and then convert this to the probability that the winning variant is the superior variant. If the data isn't normally distributed I'm using a Box-Cox transformation first. Probability is easy for people to understand and it has thus far simplified communicating the results of A/B tests.

Comment by Michael Clayton on May 24, 2016 at 6:10pm

Thanks...again.   The key seems to be visualization first then structured statements that meet legal requirements in court appearances, and academic communications standards.  For example. I was once told by a legal eagle that any statistical statement should start with "It seems to me" and then "that there is sufficient statistical evidence to assume" and then "that A is > B" with 99% certainty."   I am sure its worse now days, but GRAPHICS makes all the difference.

Comment by Robert Lemay on March 17, 2016 at 10:45am

Hello Vincent. Your article is 100% top level and you are 100% right. The problem is not only with traditional statisticians but with traditional mathematicians. The difficult point is that a mathematician is only interested talking to another mathematician. Explaining a simple "thumb" rule to a non specialist is loosing time for him. But on the other side, we utterly need them...we need them a week per year. The 51 weeks left, we need a mathematician that builds "thumbs" rules from complicated theories. But on my 28 year on the job, I never met such a profile. The next Field's price should be given to such work. Another point of this problem is that a Fortune 500 company would never pay a mathematician for "thumb" rules production...

Myself, I am not a mathematician nor statistician but a simple engineer using mathematics as a tool. So not finding any such profile I build myself a very simple toolbox (but very useful) using "truth tables", Venn diagram (up to 8 fields) and basic linear algebra. My mathematicians's friends (beleive me, I have some!) call it the "the calculation cooking book".

As a consultant I work in Service Desk, user/Customer relations and IT ressource management and I am using numbers, relations, categorisation, proof... every day. Where I loose ground, is where I have 3 or 4 simple relations about one event and to transforme them into one unique relation. If you have ideas about this, ready to talk about. Regards.Robert

Comment by Vincent Granville on March 11, 2016 at 11:10am

Dalila, this is not point estimation, unless you consider computing the lower and upper bounds of a confidence intervals to be point estimation - in which case every number is a point estimation - including the power of a test, or the type I error.

Comment by Dalila Benachenhou on March 11, 2016 at 5:36am

By the way, when analyzing a confusion matrix, you may need to talk about Type I and Type II error, which are call False Positive of False Negative, by data scientists.  By the way, False Negative is type II error, and False Positive is type I error.

Comment by Dalila Benachenhou on March 11, 2016 at 5:14am

There are 2 categories for parameters inference: estimation, and statistical test.  What you presented in "Statistical tests of hypotheses revisited" is called point estimation.

Comment by akindaini bolarinwa on March 10, 2016 at 8:58am

Very insightful, enough of this pessimistic approach to hypothesis testing called 'P-value'. This term is only understood by academicians and hard to explain to professionals who actually use the result of data analysis.

Comment by Haardik Sharma on March 8, 2016 at 9:44pm

This was indeed a much-required change from the ASA. These changes should have been done almost a decade ago, still it is better late than never. p-values are often misleading at times, especially when we are working with huge datasets. With the onset of Big Data Analytics, there is actually no need to worrying about analyzing a sample when we have computation powers to analyze entire population!. Great job by Dr. Vincent in proposing this alternative set of strategy/framework to fill the tech gap.

p-values are at best underinformative and often misleading