This is becoming a bigger issue every month - authors publishing articles about some statistical technique in a data science blog - while these techniques not only don't work well in many contexts, but in addition can not be understood or interpreted by the layman or your client. It makes data science looks bad.
In this case, it's about classical tests of hypotheses, implicitly (and unfortunately) assuming that the underlying distribution is normal. The drawbacks are as follows:
DSC Resources
Additional Reading
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Comment
Roger, I am in a nonparametric statistics class as I right this. We are going over tests to see if our sample matches a particular distribution. Sometimes this distribution is normal, but there are many others as well. We study a number of distributions and hypothesis testing for all.
Have to get back to class.
Louis, I enjoy many of your posts, but with regard to 1, one of the things that the the Central Limit Theorem (the basis for most classical hypothesis testing) says that given ANY data distribution, the sampling distribution will be normal. Hypothesis testing isn't performed against the original data distribution, but rather against the sampling distribution and the normality of that distribution (or at least the symmetry) is very good above a 30-40 sample size which I think we would all agree is quite low by "big data" standards. I think this may be the same point Mr. Polymov was making. So if your initial statement is off-target, I'm even more concerned about statement 7 since you seem to be illustrating the point opposite the one you state which is to say, the dangers of not having a fundamental understanding of statistics.
Louis, I wrote many articles on advertising optimization. I also recommend a simple method: create a taxonomy for your ads or landing pages (if you are a publisher like Google), and a taxonomy for all the web pages where an ad could potentially be displayed. Show an ad belonging to category x on a page belonging to category y, with x = y, or at least x close to y. The creation of a taxonomy can be done easily using what I describe as indexation or tagging algorithms. These are rudimentary algorithms that nevertheless can perform clustering on massive amounts of unstructured text data: assigning a category to a page is the clustering step, and you could cluster n pages in O(n) time, that is, incredibly fast. The core element to this technique is to produce (semi manually) a list of (say) 500 categories, with, for each category, dozens or hundreds keywords attached to it.
The poor performance (non relevant ads) that we all see can be explained by the following:
1. Advertiser pays very little per click, does not care about targeting
2. Too many ads, not enough inventory, publisher wants to make money and over-deliver
So, you are saying that all statistics can be reduced down to this single concept and that it will work for everything. Because a concept is complex or difficult does not mean it is not correct or important. Quantum mechanics is incredibly complex and difficult, but it works. The goal of science is not to produce concepts that people can understand simply. The goal is to find answers and actually manipulate the world.
We still have a long way to go in many areas of data science that are being applied to existing activities. One good one is the use of data science in marketing on the Internet. The recommendation engines from different vendors (including Amazon, Netflix) are woefully inadequate and inaccurate. I find this myself, hear it from other users and even from people who advise companies on this. A related activity is add placement. This is so inaccurate that it approaches print advertising in its level of targeting. The old adage is that 50% of advertising is wasted, one just doesn't know which 50%. Data science was supposed to change that and make Internet advertising more targeted. Not so. And what is the difference? Data science was supposed to provide the answers so that ads could be aimed like lasers. Are they using this technique?
Point 4: yes, I know what a p-value means...It seems that probabilistic concepts are hard to grasp for many people, as D. Kahneman has eloquently shown in hist best-seller "Think, fast and slow".
Point 7: I think that removing probability theory and classical statistics would mean to remove the word 'science' from 'data science'. Perhaps this is the point: has 'data science' anything to do with science?
Point 10: non-parametric tests are more robust, but they have less power.
I agree completely. All these assumption-infested approaches that many "scientists" evangelize don't hold water when it comes to the real-world. An approach like the one you described is much more robust and sustainable. Still, I find that some Stats tests are not so bad (i.e. the non-parametric ones).
Point 9 is so true.
Excellent read!
Regarding point 7, I do think it is important to teach random variables, probability theory etc. in data science classes as they are the basis and grounding of many statistical models and aid in interpretation (and debugging!) in my opinion.
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central