Example of Bad Data Science: Test of Hypothesis

This is becoming a bigger issue every month - authors publishing articles about some statistical technique in a data science blog - while these techniques not only don't work well in many contexts, but in addition can not be understood or interpreted by the layman or your client. It makes data science looks bad.

In this case, it's about classical tests of hypotheses, implicitly (and unfortunately) assuming that the underlying distribution is normal. The drawbacks are as follows:

  1. I have rarely seen a normal (Gaussian) distribution in practice, and even after transformation, most distributions associated with modern problems are not normal. 
  2. This test is not robust; a few outliers will easily invalidate your conclusions.
  3. This test is subject to p-hacking, a technique consisting of replicating your test dozens of times until it provides the conclusion that you like.
  4. This test relies on p-values, an arcane concept that nobody but an elite club of initiated professionals (charging a lot of money) understand.  The term p-hacking comes from abusing p-values, to lie with statistics. Remember: there are lies, damn lies, and statistics (and Amazon reviews). Frankly, do you know what p-value means?
  5. It is very hard for the average person to understand these concepts. We have developed a math-free, stats-free methodology to perform a test of hypothesis: basically, you compute a (model-free, data-driven) confidence interval, and if the parameter that you measure is outside the bounds of your confidence interval, your assumption must be rejected. It can easily be performed even in Excel, as shown in my article. This framework is much easier to understand even by the non initiated, and in addition my confidence intervals are robust and distribution-free, unlike the standard version.   
  6. My version of "hypothesis testing" is easy to implement even in SQL. It is also universal, in the sense that it applies to any king of data, even data with outliers, not well-behaved data,  or data with a special, unusual distribution.
  7. Teaching the classic statistical version of this test is just like teaching assembly language in a programming class: this stuff should be automated and used in contexts where it works. But there is no need to teach this material in data science classes, it is a waste of time, especially since in most textbooks, there are about 100 pages of prerequisites (random variables, probability theory, and so on) before the concept can be introduced. .
  8. My approach is bottom-up (from data to modeling), that is, applied; the traditional test is top-bottom (from modeling to data), that is, theoretical.
  9. The only advantage of the classical test is that it has been published thousands of times in textbooks for over 150 years, making it some kind of (bad) standard. It was invented well before computers existed, at a time when mathematical elegance would prevail over lengthy computations. But tradition does not mean efficiency nor robustness. 
  10. Because there are so many theoretical statistical distributions and so many ways to test hypotheses, classical statistics offers more than 100 different types of tests (just like it offers dozens of confusing regression techniques): Normal test (the one criticized here), Student test, F-test, Chi-Square for independence, Chi-Square for model fitting,  Kolmogorov-Smirnoff, Wilcoxon, just to name a few popular ones. Each one can be one-sided or two-sided, and when testing multiple parameters, it gets even more complicated. Some require numerical algorithms to find the critical values. To the contrary, my approach, being distribution-independent, offers one simple universal test. And because it is based on confidence intervals, you don't even need to know what one-sided or two-sided means. And jargon such as Error I or Error II is replaced by English words: false-positives and false negatives.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 16975


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Isabella Ghement on October 27, 2016 at 4:53am
This article tries to discredit hypothesis testing by invoking a universal panacea - a confidence interval applicable to a very specific situation. Can the author be transparent about the limited settings where this interval will work (e.g., where you have data collected for a single variable)?

Among other things, the article fails to mention the equivalency between hypothesis testing and confidence intervals (if both are used to test statistical hypotheses) and the fact that confidence intervals can also be one-sided or two-sided.

The fact that statistical concepts involved in hypothesis testing are misapplied or misunderstood does not make these concepts less valuable/useful. It also does not make statistics as a discipline less valuable/useful.

I can guarantee that there will be people out there who will be equally confused about the intricacies and nuances of the confidence interval touted in this article - does that mean that the interval itself is without merit?

I would find the article more credible if the author toned down the "everything I do or propose is better than what the entire field of statistics has been doing for hundreds of years" flavour. There are better ways to get one's point across, which don't rely on putting down others and their work.
Comment by Louis Giokas on November 17, 2015 at 5:12pm

Roger, I am in a nonparametric statistics class as I right this.  We are going over tests to see if our sample matches a particular distribution.  Sometimes this distribution is normal, but there are many others as well.  We study a number of distributions and hypothesis testing for all. 

Have to get back to class.

Comment by Vincent Granville on November 15, 2015 at 9:16pm

Louis, I wrote many articles on advertising optimization. I also recommend a simple method: create a taxonomy for your ads or landing pages (if you are a publisher like Google), and a taxonomy for all the web pages where an ad could  potentially be displayed. Show an ad belonging to category x on a page belonging to category y, with x = y, or at least x close to y. The creation of a taxonomy can be done easily using what I describe as indexation or tagging algorithms. These are rudimentary algorithms that nevertheless can perform clustering on massive amounts of unstructured text data: assigning a category to a page is the clustering step, and you could cluster n pages in O(n) time, that is, incredibly fast. The core element to this technique is to produce (semi manually) a list of (say) 500 categories, with, for each category, dozens or hundreds keywords attached to it.

The poor performance (non relevant ads) that we all see can be explained by the following:

1. Advertiser pays very little per click, does not care about targeting

2. Too many ads, not enough inventory, publisher wants to make money and over-deliver

Comment by Louis Giokas on November 15, 2015 at 7:19am

So, you are saying that all statistics can be reduced down to this single concept and that it will work for everything.  Because a concept is complex or difficult does not mean it is not correct or important.  Quantum mechanics is incredibly complex and difficult, but it works.  The goal of science is not to produce concepts that people can understand simply.  The goal is to find answers and actually manipulate the world. 

We still have a long way to go in many areas of data science that are being applied to existing activities.  One good one is the use of data science in marketing on the Internet.  The recommendation engines from different vendors (including Amazon, Netflix) are woefully inadequate and inaccurate.  I find this myself, hear it from other users and even from people who advise companies on this.  A related activity is add placement.  This is so inaccurate that it approaches print advertising in its level of targeting.  The old adage is that 50% of advertising is wasted, one just doesn't know which 50%.  Data science was supposed to change that and make Internet advertising more targeted.  Not so.  And what is the difference?  Data science was supposed to provide the answers so that ads could be aimed like lasers.  Are they using this technique?

Comment by fabio mainardi on November 15, 2015 at 7:07am

Point 4: yes, I know what a p-value means...It seems that probabilistic concepts are hard to grasp for many people, as D. Kahneman has eloquently shown in hist best-seller "Think, fast and slow". 

Point 7: I think that removing probability theory and classical statistics would mean to remove the word 'science' from 'data science'. Perhaps this is the point: has 'data science' anything to do with science?

Point 10: non-parametric tests are more robust, but they have less power. 

Comment by Dr. Z on November 13, 2015 at 12:24pm

I agree completely. All these assumption-infested approaches that many "scientists" evangelize don't hold water when it comes to the real-world. An approach like the one you described is much more robust and sustainable. Still, I find that some Stats tests are not so bad (i.e. the non-parametric ones).

Comment by Homer Roderick on November 13, 2015 at 5:36am

Point 9 is so true.

Comment by David Cleere on November 13, 2015 at 1:38am

Excellent read!

Regarding point 7, I do think it is important to teach random variables, probability theory etc. in data science classes as they are the basis and grounding of many statistical models and aid in interpretation (and debugging!) in my opinion.

Comment by Douglas McLean on November 12, 2015 at 11:45am
I understand what you are saying, but p-values are perfectly fine when used in the right context. That is, when a specific question is being answered with a well designed experiment. Think A/B testing or in clinical trials where we need Type I and II errors (1-Type II is power and helps define sample sizes). Of course p-values are nonsense in the context of machine learning: who cares if the coefficient in the term in x1^2:x2:x3 is significant or not when the BIC is reduced upon its addition to a multiple polynomial linear predictor already in 50 terms?! Multiple comparisons negate their use completely! Context matters for hypothesis testing.
Comment by Valerii Podymov on November 12, 2015 at 9:16am

Thanks for the article!

I would add that

1a. Many people tend to misinterpret the meaning of the Central Limit Theorem assuming about the normality :)

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service