I met up with an old grad school friend a few weeks back. He's an accomplished professor of demography, basically the statistical study of populations, particularly surrounding births, deaths, and fertility. Where he once traveled extensively to conduct demographic surveys in remote parts of the world, my friend's now content to work with available U.S. census data to develop models and test hypotheses. Like me, he's a stats/techie guy accomplished in statistical packages such as SPSS, SAS, and Stata. He notes he's played around with R but was warned early on it couldn't scale for his demanding needs. We then morphed to an engaging discussion about open source and the evolution of statistical software. My sense was that he was wary of the R nay-Sayers, but found his statistical comfort zone hard to abandon.
Alas, though perhaps not appreciated in many statistical settings, the R of 2018 is quite a step up over the R of 2008, driven significantly by contributions such as tidyverse and data.table from its extensive open source ecosystem. Indeed, the R code I write today would be unrecognizable to a core R stickler of even 5 years ago. And as for R's very real RAM limitation, the 64 GB RAM, 2 TB SSD computer I purchased 2 years ago for $2500 has handled 95\% of the challenges I've thrown at it with aplomb. The belligerent 5% I take to the Cloud and Spark.
As our discussion progressed, we opted to "challenge" R with the type of work he now does in his research. We settled on the 5 year (2012-2016) ACS Public Use Microdata Sample files PUMS as our data source. The PUMS "are a sample of the actual responses to the American Community Survey and include most population and housing characteristics. These files provide users with the flexibility to prepare customized tabulations and can be used for detailed research and analysis. Files have been edited to protect the confidentiality of all individuals and of all individual households. The smallest geographic unit that is identified within the PUMS is the Public Use Microdata Area (PUMA).". Tellingly, there're freely-available R packages demography and censusapi that make life a lot easier for R demographers.
The data we used consists of a series of household and population files. Each household has one or more persons (population), while each person is for one and only one household. It turns out that there are about 7.4M households consisting of 15.7M persons represented in this sample. A common key, "serialno", connects the data.tables. Our task was to load both household and population data, with the mandate of joining the two to build topic-oriented data sets on demand. The raw household and population data are downloaded from zip files consisting of 4 CVS's each.
The development environment consists of Jupyter Notebook with a Microsoft R kernel. R's tidyverse, data.table, and fst packages drive the data build. What follows is the implementation code.
Read the entire blog here.