I recently downloaded a 5 year Public Use Microsample (PUMS) from the latest release of the American Community Survey (ACS) census data. The data contain a wealth of demographic information on both American households and individuals (population). The final household and population data stores are quite large for desktop computing: household consists of almost 7.5M records with 233 attributes, while population is just under 15.8M cases and 286 variables.
In addition to enabling a wealth of demographic analyses, these census data are quite suitable for performance testing functions to ingest, munge, and deliver analytics. That is, if your computer has enough firepower: both R and Python-Pandas constrain data structures by the size of memory. Fortunately, my Wintel notebook, with 64 GB RAM and 2 TB disk/solid state storage, is up to hardware task here.
My focus with this blog is on determining how R's dataframe/data.table read and write capabilities measure up to 15+ GB of raw input. Working with data this size can often deliver more clear-cut benchmarks than smaller tests repeated and aggregated. Indeed oftentimes, as is the case in this notebook, the analyst can experience order of magnitude performance differences between various approaches.
In the analyses below, I contrast the elapsed time of writing the 18 GB population dataframe to OS files using three different csv functions, R's saveRDS function with and without compression, the interoperable feather library, and the nonesuch fst library. I then, in turn, read these just-produced OS files back into dataframes/datatables and compare timing results. Each of the seven read/write approaches produces files that are portable across disparate R platforms. The feather package, in addition, interoperates between R and Python-Pandas -- a major benefit.
At the conclusion of the performance tests, I outline a generic approach to efficiently sourcing data from R to work in both R and Python-Pandas platforms using functionality from a combination of the fst and feather packages. I demonstrate the approach in R and, using the nifty reticulate package, in Python-Pandas as well.
The technology used is JupyterLab 0.32.1, Anaconda Python 3.6.5, Pandas 0.23.0, and R 3.6.0. The R data.table, fst, feather, and reticulate packages are featured.
Read the entire post here.