**The Fallacies of Data Science** Adnan Masood, PhD. & David Lazar

- Correlation = Causation, and Big Data = Information and Insights because Data Context Doesn't Matter.
- The random nature of the event drives the distribution, therefore the likely distribution also drive the events.
- Base Rate Fallacy only applies to small data-sets.
- Data dredging is negatively correlated to the data-size i.e. number of spurious correlations decrease with number of dimensions of a data-set.
- In Data Science, past performance implies Future Results! Modeling assumptions can be held as absolute truths after experiments, and variables are normally distributed unless otherwise specified.
- Random sampling in experiment design and hypothesis testing is optional. Of course real world data sets don’t have Cross validation "leakage".
- Extrapolating beyond the range of training data, especially in the case of time series data, is fine providing the data-set is large enough.
- Strong Evidence is same as a Proof! Prediction intervals and confidence intervals are the same thing, just like statistical significance and practical significance.
- Measurement Doesn't Change the System. Increasing the number of features increases the model's significance and accuracy.
- Over/under-fitting of a models can be performed irrespective of bias-variance trade-off.
**Bonus:**Renaming your Analytics dept. to Data Science dept. gives you a data science discipline & specialty overnight.

*Thanks Dr. Jim Java for reading the earlier draft and providing comments*

*Original: http://blog.adnanmasood.com/2016/05/25/the-fallacies-of-data-science/*

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central