Probably the worst error is thinking there is a correlation when that correlation is purely artificial. Take a data set with 100,000 variables, say with 10 observations. Compute all the (99,999 * 100,000) / 2 cross-correlations. You are almost guaranteed to find one above 0.999. This is best illustrated in may article How to Lie with P-values (also discussing how to handle and fix it.)

This is being done on such a large scale, I think it is probably the main cause of fake news, and the impact is disastrous on people who take for granted what they read in the news or what they hear from the government. Some people are sent to jail based on evidence tainted with major statistical flaws. Government money is spent, propaganda is generated, wars are started, and laws are created based on false evidence. Sometimes the data scientist has no choice but to knowingly cook the numbers to keep her job. Usually, these “bad stats” end up being featured in beautiful but faulty visualizations: axes are truncated, charts are distorted, observations and variables are carefully chosen just to make a (wrong) point.

Trusting data is another big source of errors. What’s the point of making a 99% accurate model if your data is 20% faulty, or worse, you failed to gather the right kind of data to start with, or the right predictors? Also, models with no sound cross-validations are bound to fail. In Fintech, you can do back-testing to check a model. But it is useless: what you need to do is called *walk-forward*, a process of testing your model trained on past data split into two sets: most recent data (the control case) and older data (the test case). Walk forward is akin to testing your data on future data that is already in your possession, it is called cross-validation in machine learning lingo. And then, you need to do it right: if the control and test data are too similar, you may end up with overfitting issues.

Trusting the R-squared is another source of potential problems. It depends on your sample size, so you can’t compare results for two sets of different sizes, and it is sensitive to outliers. Google alternatives to R-squared to find a solution. Also using the normal distribution as a panacea leads to many problems when dealing with data that has a different tail or that is not uni-modal or not symmetric. Sometimes a simple transformation, using a logistic map or logarithmic transform will fix the issue.

Even the choice of metrics can have huge consequences and lead to different conclusions based on a same data set. If your conclusions should be the same regardless of whether you use miles or yards, then choose scale-invariant modeling techniques.

Missing data can be handled inappropriately, being replaced by averages computed on available observations, even though better *imputation* techniques exist.. But what if that data is missing precisely because it behaves differently than your average? Think about surveys or Amazon reviews. Who write reviews and who do not? Of course the two categories of people are very different, and what’s more, the vast majority of people never write reviews: so reviews are based on a tiny, skewed sample of the users. The fix here is to have a few professional reviews blended with those from regular users, and score the users correctly to give the reader a better picture. If you fail to do it, soon enough all readers will know that your reviews are not trustworthy, and you might as well remove all reviews from your website, get rid of the data scientists working on the project, and save a lot of money and improve your business brand.

Much of this is discussed (with fixes) in my recent book *Statistics: new foundations, toolbox, and machine learning recipes*, available (for free) here.

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central