After posting my most recent blog using census data to illustrate handling "large" dataframes in R exploiting fst and feather file formats, I realized I could have taken the analysis a step further.

Recall my then description of "demographic information on both American households and individuals (population). The final household and population data stores are quite large for desktop computing: household consists of almost 7.5M records with 233 attributes, while population is just under 15.8M cases and 286 variables."

In addition to working with the two individual dataframes/data.tables, one could also consider the merge of household and population. For each household, there are one or more population records; each population record is, in turn, for one and only one household. Indeed, there's an attribute, serialno, that can be used to join household and population to a resulting data.table of almost 15.8M records and over 500 attributes -- consuming in excess of 32GB memory. This is quite large for desktop R.

So, I just had to attempt to produce such a structure, despite the R in-memory limitation, and then write fst and feather files for subsequent use. Alas, the attempt to produce the supersize data.table on my 64GB Wintel notebook failed with a memory allocation error (R's not the most efficient memory manager.). My response? Get the join to work on a 128GB notebook, then "save" the resulting data in fst and feather files. Once produced, transport those files to the 64GB machine to see if they can be read. Turns out that approach works quite well.

The remainder of this notebook splits attention between 128GB RAM and 64GB RAM machines, producing the joined data.table on the former and demoing access of the data from fst and feather files on the latter. Once the grand fst file is built, it can be redeployed to a smaller memory machine and accessed like an in-memory data.table with projection and filtering. The performance is excellent.

The technology used is JupyterLab 0.35.4, Anaconda Python 3.7.3, Pandas 0.24.2, and R 3.6.0. The R tidyverse, data.table, fst, and feather packages are featured.

See the entire blog here.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central