Summary: Simpson’s Paradox. A source of risk for real time analytics and for the citizen data scientist.
Most of us practicing the predictive arts know to look for sources of bias in our data. There are seven that are common, the first six of which are:
The 7th and the one for today’s thoughts is Simpson’s Paradox, sometimes described as a special case of Confounding Variables. And while all these sources of bias can lead to answers that are probably directionally correct but wrong in the specifics of the forecast, the impact of Simpson’s Paradox can be to give you totally and directionally the wrong answer.
There are many famous examples of this but for the sake of simplicity let’s use this very short one for review. Two drugs, A and B are being tested and compared for efficacy. There are two different observations, call them Test 1 and Test 2. The results of those tests are:It is evident to the observers of Test 1 and 2 that Drug B is more effective having cured a higher percentage of patients in both instances.
But a sharp data scientist might question this, wondering if Simpson’s Paradox could be at work. He examines the combined results and finds:
When the data are combined the result is exactly the opposite of the first conclusion. In real life this could have been disastrous.
And this is exactly what Simpson’s Paradox is: in which a trend that appears in different groups of data but disappears or reverses when these groups are combined. This is not meant to be a comprehensive review so I leave it to you to explore some of the other ways Simpson’s Paradox can confuse and misdirect.
So why the focus on Simpson’s Paradox and why now. Two reasons.
So let’s not forget the basics of questioning data for its hidden biases especially as data speeds up and intervals of analysis become shorter and shorter. And if you’re interacting with the Citizen Data Scientists in your organization who are getting carried away frolicking through the tea leaves let’s (gently) instruct them in these inherent dangers.
For more on Simpson’s Paradox try these links:
September 1, 2015
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. Bill is also Editorial Director for Data Science Central. He can be reached at:
The original blog can be seen here.