Home » Uncategorized

"Kindof" Big Data in R

Ask for feedback from just about any critic of the R statistical package and you’ll hear two consistent responses: 1) R is a difficult language to learn, and 2) R’s data size limitation to physical RAM consigns it to toy academic applications. Having worked with R for over 15 years, I probably shared those views to some extent early on, but no longer.

Yes, R’s a language where array-oriented processing and functional programming reign, but that’s a good thing and pretty much the direction of modern data science languages — think Python with Pandas, NumPy, SciPy, and scikit-learn. As for the memory limitation on data size, that’s much less onerous now than even five years ago.

I point this out as I develop a Jupyter Notebook on a nifty ThinkPad with 64 GB RAM, 1 TB SSD, and 1 TB SSD that I purchased two years ago for $2500. That’s about annual license maintenance for a single seat of one of R’s commercial competitors.

With my 64 GB RAM, I generally haven’t overwhelmed R except when I set out to do so — to find its limits. Indeed, when all is said and done, most consulting R analyses I’ve completed over recent years have been on final R data sizes of 2 GB or less.

My recommendations for R installations are for SSDs along with 64 GB RAM for notebook computers, and SSDs plus 256 GB RAM+ for servers. Memory is a good investment. Also, for legitimately large R ecosystem data, analysts should certainly configure R interoperability with relational/analytic databases such as PostgreSQL/MonetDB, which support virtual memory. The Spark/R collaboration also accommodates big data, as does Microsoft’s commercial R server.

Most R aficionados have been exposed to the on-time flight data that’s a favorite for new package stress testing. For me its a double plus: lots of data plus alignment with an analysis “pattern” I noted in a recent blog. The pattern involves multiple files for input, each of which has the same structure and also dimension information encoded in its name.

I took on the analysis to elucidate a strategy for loading “pretty large” data in R while showcasing favorite packages data.table, tidyverse, and the splendid new fst for “Lightning Fast Serialization of Data Frames for R”. Incidentally, none of these packages is part of core R, but rather the bounty of an exuberant R ecosystem. Sadly, some naysayers are still fighting the can’t do battle from ten years ago — before the ecosystem largesse exploded and changed the R munging landscape.

What follows is an illustration of using a functional programming style plus powerful package capabilities to combine in-memory and file processing. The focus is on data assembly/wrangling rather than broader data analysis. To that end, I downloaded 216 monthly on-time flight files from 2000 through 2017. Each of these consists of 110 attributes and about 500,000 records. With my 64 GB RAM, I was able to load only about 2/3rds of the total data into memory at once. So my strategy became to handle three years at a time using data.table with tidyverse, then to offload to a compressed file with fst. At the end of the work stream, I’d created six fst files for subsequent use. From there I could exploit fst’s speedy access to select a subset of columns and build a new data.table by stacking the files. The power and efficiency of data.table, tidyverse, and fst delivered.

Below is the code used to read the text files and write the fst files — and then to read and stack the fst data to produce sample projected data.tables. The top-level technologies are JupyterLab with an R 3.4 kernel.

See the entire post, here.