Simpson’s Paradox in the Age of Real Time Analytics

Summary:  Simpson’s Paradox.  A source of risk for real time analytics and for the citizen data scientist.

Most of us practicing the predictive arts know to look for sources of bias in our data.  There are seven that are common, the first six of which are:

  1. Confirmation Bias
  2. Selection Bias
  3. Outliers
  4. Over Fitting and Under Fitting
  5. Confounding Variables
  6. Non-normality

The 7th and the one for today’s thoughts is Simpson’s Paradox, sometimes described as a special case of Confounding Variables.  And while all these sources of bias can lead to answers that are probably directionally correct but wrong in the specifics of the forecast, the impact of Simpson’s Paradox can be to give you totally and directionally the wrong answer.

There are many famous examples of this but for the sake of simplicity let’s use this very short one for review.  Two drugs, A and B are being tested and compared for efficacy.  There are two different observations, call them Test 1 and Test 2.  The results of those tests are:It is evident to the observers of Test 1 and 2 that Drug B is more effective having cured a higher percentage of patients in both instances.

But a sharp data scientist might question this, wondering if Simpson’s Paradox could be at work.  He examines the combined results and finds:

When the data are combined the result is exactly the opposite of the first conclusion.  In real life this could have been disastrous.

And this is exactly what Simpson’s Paradox is:  in which a trend that appears in different groups of data but disappears or reverses when these groups are combined.  This is not meant to be a comprehensive review so I leave it to you to explore some of the other ways Simpson’s Paradox can confuse and misdirect.

So why the focus on Simpson’s Paradox and why now.  Two reasons.

  1. Real Time Analytics:  The entire thrust of real time analytics is to be able to spot a pattern and take action in shorter and shorter time periods.  The shorter the time periods the more likely that the true overall trend is masked by short term misdirections.  If you’re involved in real time analytics I suggest you do a serious risk analysis of what the consequences would be for your employer or client if you were misdirected by Simpson’s Paradox and took exactly the wrong action.
  2. Citizen Data Scientists:  This is the term that Gartner has given to well-intentioned managers who are given access to data and even some predictive analytic tools with the intent that they discover insights for themselves.  A very significant portion of software development in predictive analytics is attempting to automate data science to the point that Citizen Data Scientists can achieve this goal.  You see software offerings in automated data prep and cleansing, heavily templated and simplified tools for regression and decision trees, and of course lots and lots of data viz tools which are supposed to create the ability to “see” the answer in complex correlations by just looking.  All of this is exacerbated by the shortage of data scientists and the inexperience of most organization about how to implement predictive analytics.  If you are relying on heavily templated and packaged software and have no awareness of what’s going on under the hood, what’s the likelihood that you would spot this bias or for that matter any of the other six?  Pretty much none.

So let’s not forget the basics of questioning data for its hidden biases especially as data speeds up and intervals of analysis become shorter and shorter.  And if you’re interacting with the Citizen Data Scientists in your organization who are getting carried away frolicking through the tea leaves let’s (gently) instruct them in these inherent dangers.

For more on Simpson’s Paradox try these links:





September 1, 2015

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.


About the author:  Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  Bill is also Editorial Director for Data Science Central.  He can be reached at:

[email protected] or [email protected]

The original blog can be seen here.

Views: 6343


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Jason Bowles on June 7, 2016 at 5:02am

I've been experimenting with weighted ranking lately and if you applied some basic weighting to the example above drug B still comes out on top in both tests individually, but the difference is less significant, to the point that I personally would investigate a little further before making a decision.

Just out of curiosity, anyone else weighting the results before comparing?  (especially when sample sizes are different)

Comment by Sione Palu on June 6, 2016 at 1:29pm

I would add data sparsity to the list.

Comment by Michael Clayton on June 2, 2016 at 2:15pm

Prediction is fuzzy but necessary work.  

2016 election promises modeled for tax and revenue likely results should be really fuzzy..and scary.  

But we will see lots of predictions based on models used by agents with high sounding names and restudied by media "big data" teams to re-spin the results.  Then there is Nate Silver who surprised the Karl Rove's of the world when Romney lost earlier that Karl's team was telling him.  All with merged survey data as I understand.

Thanks for the full year and a half of great articles to get the full flavor of why "all models are wrong, but some are useful."  (GEP Box) 

Comment by Dane Palmer-Illingsworth on September 2, 2015 at 11:11pm

Great post. I particularly like point 2. More software abstracting analysis methodologies coupled with "well-intentioned", but untrained users is a recipe for disaster. 

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service