Summary: Simpson’s Paradox. A source of risk for real time analytics and for the citizen data scientist.
Most of us practicing the predictive arts know to look for sources of bias in our data. There are seven that are common, the first six of which are:
The 7^{th} and the one for today’s thoughts is Simpson’s Paradox, sometimes described as a special case of Confounding Variables. And while all these sources of bias can lead to answers that are probably directionally correct but wrong in the specifics of the forecast, the impact of Simpson’s Paradox can be to give you totally and directionally the wrong answer.
There are many famous examples of this but for the sake of simplicity let’s use this very short one for review. Two drugs, A and B are being tested and compared for efficacy. There are two different observations, call them Test 1 and Test 2. The results of those tests are:It is evident to the observers of Test 1 and 2 that Drug B is more effective having cured a higher percentage of patients in both instances.
But a sharp data scientist might question this, wondering if Simpson’s Paradox could be at work. He examines the combined results and finds:
When the data are combined the result is exactly the opposite of the first conclusion. In real life this could have been disastrous.
And this is exactly what Simpson’s Paradox is: in which a trend that appears in different groups of data but disappears or reverses when these groups are combined. This is not meant to be a comprehensive review so I leave it to you to explore some of the other ways Simpson’s Paradox can confuse and misdirect.
So why the focus on Simpson’s Paradox and why now. Two reasons.
So let’s not forget the basics of questioning data for its hidden biases especially as data speeds up and intervals of analysis become shorter and shorter. And if you’re interacting with the Citizen Data Scientists in your organization who are getting carried away frolicking through the tea leaves let’s (gently) instruct them in these inherent dangers.
For more on Simpson’s Paradox try these links:
http://plato.stanford.edu/entries/paradox-simpson/
https://en.wikipedia.org/wiki/Simpson%27s_paradox
September 1, 2015
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. Bill is also Editorial Director for Data Science Central. He can be reached at:
[email protected] or [email protected]
The original blog can be seen here.
Comment
I've been experimenting with weighted ranking lately and if you applied some basic weighting to the example above drug B still comes out on top in both tests individually, but the difference is less significant, to the point that I personally would investigate a little further before making a decision.
Just out of curiosity, anyone else weighting the results before comparing? (especially when sample sizes are different)
I would add data sparsity to the list.
Prediction is fuzzy but necessary work.
2016 election promises modeled for tax and revenue likely results should be really fuzzy..and scary.
But we will see lots of predictions based on models used by agents with high sounding names and restudied by media "big data" teams to re-spin the results. Then there is Nate Silver who surprised the Karl Rove's of the world when Romney lost earlier that Karl's team was telling him. All with merged survey data as I understand.
Thanks for the full year and a half of great articles to get the full flavor of why "all models are wrong, but some are useful." (GEP Box)
Great post. I particularly like point 2. More software abstracting analysis methodologies coupled with "well-intentioned", but untrained users is a recipe for disaster.
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central