I met up with an old grad school friend a few weeks back. He's an accomplished professor of demography, basically the statistical study of populations, particularly surrounding births, deaths, and fertility. Where he once traveled extensively to conduct demographic surveys in remote parts of the world, my friend's now content to work with available U.S. census data to develop models and test hypotheses. Like me, he's a stats/techie guy accomplished in statistical packages such as SPSS, SAS, and Stata. He notes he's played around with R but was warned early on it couldn't scale for his demanding needs. We then morphed to an engaging discussion about open source and the evolution of statistical software. My sense was that he was wary of the R nay-Sayers, but found his statistical comfort zone hard to abandon.

Alas, though perhaps not appreciated in many statistical settings, the R of 2018 is quite a step up over the R of 2008, driven significantly by contributions such as tidyverse and data.table from its extensive open source ecosystem. Indeed, the R code I write today would be unrecognizable to a core R stickler of even 5 years ago. And as for R's very real RAM limitation, the 64 GB RAM, 2 TB SSD computer I purchased 2 years ago for $2500 has handled 95\% of the challenges I've thrown at it with aplomb. The belligerent 5% I take to the Cloud and Spark.

As our discussion progressed, we opted to "challenge" R with the type of work he now does in his research. We settled on the 5 year (2012-2016) ACS Public Use Microdata Sample files PUMS as our data source. The PUMS "are a sample of the actual responses to the American Community Survey and include most population and housing characteristics. These files provide users with the flexibility to prepare customized tabulations and can be used for detailed research and analysis. Files have been edited to protect the confidentiality of all individuals and of all individual households. The smallest geographic unit that is identified within the PUMS is the Public Use Microdata Area (PUMA).". Tellingly, there're freely-available R packages demography and censusapi that make life a lot easier for R demographers.

The data we used consists of a series of household and population files. Each household has one or more persons (population), while each person is for one and only one household. It turns out that there are about 7.4M households consisting of 15.7M persons represented in this sample. A common key, "serialno", connects the data.tables. Our task was to load both household and population data, with the mandate of joining the two to build topic-oriented data sets on demand. The raw household and population data are downloaded from zip files consisting of 4 CVS's each. 

The development environment consists of Jupyter Notebook with a Microsoft R kernel. R's tidyverse, data.table, and fst packages drive the data build. What follows is the implementation code.

Read the entire blog here.

Views: 2457


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by steve miller on April 8, 2018 at 7:44am

Eric --

Thanks for the comment.

There have been quite a few performance enhancements in R over the last ten years, including the ascent of Rcpp, tidyverse, and data.table, as well as seamless multi-core processing. Have you used h2o for modeling? It's at least half an order of magnitude faster than most R modeling packages and can handle sizes that those packages can't even dream of.

Five years ago, I did much more C/Rcpp than I do now. I've also loaded up on RAM and SSDs with my current notebook. That seems a no-brainer.

I'm quite enthralled with functional programming, as demonstrated by this cell from the Medicare blog: 

storage <- map_df(names(medicarephysicians),function(nm,dts=medicarephysicians,dtf=medicarephysicianf)                          i
(class(dts[[nm]])=="character") list(nm=nm,oss=object.size(dts[[nm]]),osf=object.size(dtf[[nm]]))) %>%


One reason you see less functional stuff and tidyverse "%>%" in these blogs is that I generally choose data.table calcs over tidyverse for performance. In the "smaller" stuff I do, tidyverse is more prevalent.

Like you, I've grown fonder with ggplot2 over time and it's now my go-to for R graphics, replacing lattice.

I'm also keen on the Python/R integration packages rpy2 and reticulate. It won't be long before you'll be able to seamlessly mix and match Python/R code in both R-markdown and Jupyter notebook.

I'm quite bullish on the future of Python/R for data science.

Comment by Eric Milgram on April 7, 2018 at 11:57am

Hi Steve,

I really enjoyed this post. This comment resonated especially strongly with me:

Alas, though perhaps not appreciated in many statistical settings, the R of 2018 is quite a step up over the R of 2008, driven significantly by contributions such as tidyverse and data.table from its extensive open source ecosystem. Indeed, the R code I write today would be unrecognizable to a core R stickler of even 5 years ago.

In your post, you demonstrate some excellent techniques for optimizing the performance of R for 'biggish' data by using a combination of techniques such as selecting appropriate packages and performing manual caching. I've been coding in R for over a decade, now.

During that time, my experience is that getting started with R is fairly easy, but optimizing R to have good performance with bigger datasets is not as easy as my intuition gained from my experience with other software platforms, such as C or C++, would lead me to believe.

The Rcpp package would have been very valuable to me a decade ago. Before it was available, when I needed to squeeze out every bit of performance my computer could muster, I spent a lot of tedious hours coding my own custom versions of R functions (usually in C). Usually, the tedium didn't arise from writing the C or C++ code. Often, that part was relatively trivial. The frustration was most commonly the result of trying to get my tool-chain working. The addition of rtools for Windows coupled with Rcpp has made writing custom R C/C++ functions for  performance enhancement a much less frustrating experience.

I watched as the Tidyverse came into being, and admittedly, I resisted it at first. I didn't really begin to embrace the newer paradigms enabled by the Tidyverse until about two years ago. Perhaps due to coincidence, I have not coded a custom function for performance in at least two years.

Your post was interesting to me because your detailed code example takes excellent advantage of newer packages for managing larger datasets. However, your code is very reminiscent of pre-Tidyverse code.

I'm curious what you think about the newer functional programming paradigm, such as the use of %>% operator.

I learned C before learning C++, and as result, I believe that my transition to "thinking in C++" was delayed because I was so comfortable with C constructs, such as pointers and classic file I/O which were readily available in C++. As a result, I didn't embrace operator overloading because I felt as though it created more obscurity than clarity. For example, I did not like the concept of modifying C++ I/O using stream manipulators as follows.

std::cout  "The number 0.01 in fixed: "  std::fixed  0.01  '\n';

Similarly, I didn't originally like how ggplot2 visualizations were created using a similar technique, where successive layers of a plot were built by adding objects that manipulated plot elements. I saw these motifs as more syntactic sugar than substance. However, in both C++ and R, I believe my code has improved once I embraced the newer paradigms. The reason for the improvement was not really performance related. C is very hard to beat for raw performance. Rather, using the newer paradigms, my code is much easier to read and debug now.

Just a few days ago, I was helping someone debug a visualization they'd built with a single statement in ggplot2. By breaking it down into multiple statements, I was able to identify quickly where they were having a problem and fix it.

Although I still catch myself occasionally reverting to base graphics or the more classic procedural programming paradigm when I'm in a hurry and doing something relatively simple, this happens less frequently as more time passes.

Thanks again for the excellent write-up!

Comment by steve miller on April 5, 2018 at 4:43am

Paul --

Thanks for the comment.

As you've probably surmised, I'm a big R fan. I'm also into Python and will write some Python blogs in the future.

I worked extensively with SAS from 1980 to 2000. At the time, it was the best statistical computing software available. I even used the SAS language in the early 90's for relational data mart ETL. The programming environment of data steps, procs, and macros now seems clunky and dated. I much prefer the more modern R, Python, and Julia.

Yet SAS is still the big gorilla, with installations in most large companies, governments, universities, etc. Kind of reminds me of Cobol earlier in my career: you may not like it but you can't avoid it. Fortunately, SAS now has both R and Python api's. 

SAS can certainly handle the data challenges I've tackled in recent blogs. But my SAS code often ends up ugly, driven by a macro glue language that gives me headaches. I find meta-data work much easier in R/Python. One challenge readily handled by R and Python but a struggle in SAS: given a data set (or data frame), produce a second data set with only the columns of the first that meet a pre-specified threshold percentage of not null.

Enterprise Guide is pretty nice, as is Enterprise Miner. If your company licenses SAS already, they may be decent choices for stats and ML On the other hand, if you're starting from scratch on a limited budget, SAS might be a non-starter.

I've worked with Alteryx extensively and like it a lot. Though not as powerful as a full-blown ETL platform, Alteryx can handle many data challenges of "citizen data scientists". In the end, Alteryx generates R code for statistics/analytics -- a good thing.

I'm a programmer and so take that side in the programming vs drag-and-drop debate. I think drag-and-drop is more suitable for modeling than it is for data programming. Alas, in my 40 years of data/analyics, the data side has consumed 80% of my time. I'm playing with Trifacta now and, as mentioned, I like Alteryx.

Don't underestimate the power of the R and Python open source ecosystems. tidyverse and data.table in R along with Pandas for Python, all bounty from outside the cores, now drive data programming in those languages.

In sum, I prefer the on-the-rise open source platforms R and Python for statistical computing, and also side with programming vs drag-and-drop, especially on the data side.

But I acknowledge other, equally valid vantage points as well.

Comment by Paul Bremner on April 4, 2018 at 2:35pm

Steve, I enjoy reading these posts on R and what it can do (particularly liked the last one using Medicare data.)  I'm currently learning SAS (both programming and drag-and-drop apps) and will have to make some decisions about R/Python (and non-SAS Data Science platforms) down the road.

You mentioned that both you and your professor colleague know other packages like SAS, SPSS and Stata.  As you work through these data sets I'd be curious whether either of you can speak to the time required to do these in R vs SAS (in the latter using either programming, or something like Enterprise Guide, Enterprise Miner or Visual Analytics/Statistics.) Or any "DS platform" for that matter (i.e. Rapidminer, Alteryx, etc.)  And, aside from the cost issue do you feel there are things that can be done better in R with this sort of structured data, or that are simply not doable in SAS and the others? Look forward to your future work.        

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service