Home » Uncategorized

Data Dictionary to Meta Data II — Simple Text Wrangling and Factor Creation in R


My blog last week articulated a first shot at automating the creation of meta data for the American Community Survey 2012-2016 household data set, using its published data dictionary. I deployed Python to wrangle the DD, ultimately generating the R syntax to convert many of the data.table’s integers to R factors with levels and labels. While this “worked” and the Python code was simple, the process of moving between the Python wrangling and R data creation Jupyter notebooks was clunky and error-prone.

This notebook hopefully presents the next level of sophistication with that work. Entirely R code both munges the data dictionary file to produce R syntax, and then applies the generated code to the household data table to convert many of its attributes from integers to factors.

The work stream consists of several steps. Task one is to parse/wrangle the data dictionary file. The two-part wrangling strategy I adopted is akin to my dental hygienest’s teeth cleaning approach: first use “ultrasonic tools”, grep and gsub, to clear the heavy deposits. The ultrasonic progressively eliminates unuseful lines from further scrutiny, then does several global replacements on those remaining. Then switch to the “hand tools”, R named lists and loops, for the more refined cleaning (programming) work.

Those tools initially create an R named list “dictionary” to hold the attribute names along with their codes and labels from the DD file. The next step is to populate a second dictionary that contains the syntax that’ll drive the factor creation — and then to execute that syntax dynamically.

Now loop through all household attributes that are “qualified”, converting integers to factors driven from the structures created above. And finally, preliminarily corroborate the work by computing frequencies on several of the before/after variables.

What follows is the code for the entire “journey”. The technologies used are Microsoft Open R 3.4.3 and Jupyter Notebook.

Read the entire blog here.