Anyone who attended statistical training at the college level has been taught the four rules that you should always abide by, when developing statistical models and predictions:
As a data scientist and ex-statistician, I violate these rules (especially #1 - #3) almost daily. Indeed, that's part of what makes data science different from statistical science.
The reasons for violating these rules are:
Some theoretical research should be performed about the maximum yield obtainable with non-kosher estimates.
Performance of some non-kosher estimates
This article compares model-free confidence intervals with classic ones. The difference is very small even when the number of observations is as low as 50. In this case, rule #4 is not violated.
In my article on Jackknife regression, you can check that approximate, biased, but robust and easy-to-interpret parameter estimates, yield pretty much the same results as classic estimators, even though they violate all four rules.
Computations comparing the unbiased with the bias-corrected version of the traditional variance estimate show very little differences, even for small samples with 100 observations.
Finally, in the context of local density estimation or clustering based on nearest-neighbors, I used a robust estimate that is a combination of several nearest neighbors, rather than just the k-th nearest neighbor (k-NN). My estimate does not achieve minimum variance (the rule #2), yet it is far more robust: I introduced noise in the data to see how both estimates react to noise - my estimate versus the classic one. Note that my estimate almost achieves minimum variance. The details can be found here.
Related article:
About the Author
Vincent Granville worked for Visa, eBay, Microsoft, Wells Fargo, NBC, a few startups and various organizations, to optimize business problems, boost ROI or to develop ROI attribution models, developing new techniques and systems to leverage modern big data and deliver added value. Vincent owns several patents, published in top scientific journals, raised VC funding, and founded a few startups. Vincent also manages his own self-funded research lab, focusing on simplifying, unifying, modernizing, automating, scaling, and dramatically optimizing statistical techniques. Vincent's focus is on producing robust, automatable tools, API's and algorithms that can be used and understood by the layman, and at the same time adapted to modern big, fast-flowing, unstructured data. Vincent is a post-graduate from Cambridge University.
DSC Resources
Additional Reading
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Comment
For the record: No one familiar with the theory of statistics says that you should only use unbiased estimates, nor even that an unbiased estimate, when available, is better than a biased estimate.
Thank You very much.
There's heaps of recent "feature selection" methods that have been published & made available in the literature in the past 6 years or so. I use a few of them like Lapacian-Score, MLFS (maximum-likelihood feature selection in supervised regression/classification), HSIC-LASSO (Hilbert-Schmidt independence criterion + least absolute shrinkage and selection operator for high-dimensional feature selection in supervised regression/classification), and more... Those authors have made their matlab codes available on the net.
So, anyone who thinks that feature-selection researches is somehow, not advancing enough, is simply haven't followed the literature to find out what's new & wha'ts being improved over previously existing methods.
Posted 12 April 2021
© 2021 TechTarget, Inc. Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central