I started a series on causal inference for data science a few weeks back. I think CI methodologies offer great potential for the DS discipline, given that much of our data is observational i.e. outside experimental control.

As I noted then, "The platinum design for causal inference is the experiment where subjects are randomly assigned to the different treatment groups. With randomization, the effects of uncontrolled or confounding factors (the Z's) should, within sampling limitations, be "equal" or "balanced" across treatments or values of X. In such settings of "controlled" Z's, the analyst is much more confident that a correlation between X and Y actually indicates causality.

But what about the in-situ data gathering schemes generally seen in the DS world, where data are observational, and confounders are free to roam? What is one to do? The answer: consider causal inference techniques that attempt to statistically mimic the randomized experiment."

In that blog, I introduced data the from the American Community Survey. Details of data set construction can be found there.

The question I purported to address with that data is what if any income difference it makes for individuals to hold a terminal master's degree vs a terminal bachelor's. Since we can't conduct an experiment where population is assigned at random to either master's or bachelor's degree "treatments", it made sense to consider a CI technique such as matching to see if we could untangle the effects of the education "treatment" from uncontrolled covariates/confounders such as age, sex, marital status, and race that might differ between the education groups out of the gate.

The technique I deployed was nearest neighbor matching using the results of a propensity model detailing if/how the "treatment" covaried with the confounders. The results indicated that if all impactful confounders had been included -- a critical assumption -- that there was indeed a meaningful difference in income between the two education levels. Moreover, when the matching adjustments were applied, the income difference was smaller, but still off-the-charts significant. This reduction made sense given that master's degreed cases were older and more likely to be married -- indicators that positively related to income on their own.

Though I was pretty happy with the results, I was less enthused about the computational intensity of the chosen technique. It took over 70 minutes for the calculations against a random subset of 250,000 of the more than .5M suitable records to complete. With that kind of performance, such models would be less than ideal for data science work.

I also discovered critiques of propensity model-driven matching by Harvard professor Gary King et...., who're trailblazers in causal inference and authors of the popular R CI package, MatchIt.

As a result, I decided for this analysis to try "exact matching" on the entire .5M+ data file. em is a much simpler and computationally more benign technique that only involves basic SQL-like wrangling. It turns out that em worked quite well with this data, completing calculations against the full file in under 30 seconds. The code and results are detailed below.

The technology used in the analysis is JupyterLab with Microsoft Open R, 3.4.4. For the matching work, the MatchIt, tableone, and data.table packages are deployed.

Next time I'll consider coarsened exact matching, an extension to em that promotes a higher matching rate, thus potentially lowering estimate variance.

Find the remainder of the blog here.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central