My last DSC blog left me a bit disappointed. While the loads of the beefy household and population files for the American Community Survey worked well, the data, just about entirely integer, represents categorical attributes whose meta info is not included. The data dictionary detailing those meanings is available, but the challenge of connecting the dictionary to the data is left for the analyst.
I decided to see if it'd be productive to wrangle the data dictionary in the hopes of generating code that would add the meta data to R structures -- in R parlance, transforming integer attributes to factors with levels and labels. I was determined not to spend a lot of time on the challenge, to accept a quick and dirty solution that might give me 75% of what was needed.
It turns out that without a ton of work, I did have a modicum of success that, alas, whetted my appetite to go beyond quick and dirty. The data dictionary file was quite cooperative, with "regularities" that simplify munging.
For this V1 effort, I used Python 3.5 in Jupyter Notebook to parse the dictionary text and ultimately generate R factor create statements against the existing numeric data.. At this point, the output R code needs to be cut and pasted from the Python notebook and executed in R. This is an ugly but temporary solution. Ultimately, the hope is that parsing will generate code that'll run seamlessly in a single notebook. I already have plans for subsequent iterations.
Read the entire blog here.