Ask for feedback from just about any critic of the R statistical package and you'll hear two consistent responses: 1) R is a difficult language to learn, and 2) R's data size limitation to physical RAM consigns it to toy academic applications. Having worked with R for over 15 years, I probably shared those views to some extent early on, but no longer.

Yes, R's a language where array-oriented processing and functional programming reign, but that's a good thing and pretty much the direction of modern data science languages -- think Python with Pandas, NumPy, SciPy, and scikit-learn. As for the memory limitation on data size, that's much less onerous now than even five years ago.

I point this out as I develop a Jupyter Notebook on a nifty ThinkPad with 64 GB RAM, 1 TB SSD, and 1 TB SSD that I purchased two years ago for $2500. That's about annual license maintenance for a single seat of one of R's commercial competitors.

With my 64 GB RAM, I generally haven't overwhelmed R except when I set out to do so -- to find its limits. Indeed, when all is said and done, most consulting R analyses I've completed over recent years have been on final R data sizes of 2 GB or less.

My recommendations for R installations are for SSDs along with 64 GB RAM for notebook computers, and SSDs plus 256 GB RAM+ for servers. Memory is a good investment. Also, for legitimately large R ecosystem data, analysts should certainly configure R interoperability with relational/analytic databases such as PostgreSQL/MonetDB, which support virtual memory. The Spark/R collaboration also accommodates big data, as does Microsoft's commercial R server.

Most R aficionados have been exposed to the on-time flight data that's a favorite for new package stress testing. For me its a double plus: lots of data plus alignment with an analysis "pattern" I noted in a recent blog. The pattern involves multiple files for input, each of which has the same structure and also dimension information encoded in its name.

I took on the analysis to elucidate a strategy for loading "pretty large" data in R while showcasing favorite packages data.table, tidyverse, and the splendid new fst for "Lightning Fast Serialization of Data Frames for R". Incidentally, none of these packages is part of core R, but rather the bounty of an exuberant R ecosystem. Sadly, some naysayers are still fighting the can't do battle from ten years ago -- before the ecosystem largesse exploded and changed the R munging landscape.

What follows is an illustration of using a functional programming style plus powerful package capabilities to combine in-memory and file processing. The focus is on data assembly/wrangling rather than broader data analysis. To that end, I downloaded 216 monthly on-time flight files from 2000 through 2017. Each of these consists of 110 attributes and about 500,000 records. With my 64 GB RAM, I was able to load only about 2/3rds of the total data into memory at once. So my strategy became to handle three years at a time using data.table with tidyverse, then to offload to a compressed file with fst. At the end of the work stream, I'd created six fst files for subsequent use. From there I could exploit fst's speedy access to select a subset of columns and build a new data.table by stacking the files. The power and efficiency of data.table, tidyverse, and fst delivered.

Below is the code used to read the text files and write the fst files -- and then to read and stack the fst data to produce sample projected data.tables. The top-level technologies are JupyterLab with an R 3.4 kernel.

See the entire post, here.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central