I started a series on causal inference for data science a few weeks back. I think CI methodologies offer great potential for the DS discipline, given that much of our data is observational i.e. outside experimental control.
As I noted then, "The platinum design for causal inference is the experiment where subjects are randomly assigned to the different treatment groups. With randomization, the effects of uncontrolled or confounding factors (the Z's) should, within sampling limitations, be "equal" or "balanced" across treatments or values of X. In such settings of "controlled" Z's, the analyst is much more confident that a correlation between X and Y actually indicates causality.
But what about the in-situ data gathering schemes generally seen in the DS world, where data are observational, and confounders are free to roam? What is one to do? The answer: consider causal inference techniques that attempt to statistically mimic the randomized experiment."
The question I purported to address with that data is what if any income difference it makes for individuals to hold a terminal master's degree vs a terminal bachelor's. Since we can't conduct an experiment where population is assigned at random to either master's or bachelor's degree "treatments", it made sense to consider a CI technique such as matching to see if we could untangle the effects of the education "treatment" from uncontrolled covariates/confounders such as age, sex, marital status, and race that might differ between the education groups out of the gate.
The technique I deployed was nearest neighbor matching using the results of a propensity model detailing if/how the "treatment" covaried with the confounders. The results indicated that if all impactful confounders had been included -- a critical assumption -- that there was indeed a meaningful difference in income between the two education levels. Moreover, when the matching adjustments were applied, the income difference was smaller, but still off-the-charts significant. This reduction made sense given that master's degreed cases were older and more likely to be married -- indicators that positively related to income on their own.
Though I was pretty happy with the results, I was less enthused about the computational intensity of the chosen technique. It took over 70 minutes for the calculations against a random subset of 250,000 of the more than .5M suitable records to complete. With that kind of performance, such models would be less than ideal for data science work.
I also discovered critiques of propensity model-driven matching by Harvard professor Gary King et...., who're trailblazers in causal inference and authors of the popular R CI package, MatchIt.
As a result, I decided for this analysis to try "exact matching" on the entire .5M+ data file. em is a much simpler and computationally more benign technique that only involves basic SQL-like wrangling. It turns out that em worked quite well with this data, completing calculations against the full file in under 30 seconds. The code and results are detailed below.
The technology used in the analysis is JupyterLab with Microsoft Open R, 3.4.4. For the matching work, the MatchIt, tableone, and data.table packages are deployed.
Next time I'll consider coarsened exact matching, an extension to em that promotes a higher matching rate, thus potentially lowering estimate variance.
Find the remainder of the blog here.