Subscribe to DSC Newsletter

PostgreSQL, MonetDB, and Too-Big-for-Memory Data in R -- Part I

In a blog from three months ago, I wrote on "kind of" big data in R. The "kind of" was meant as a caveat that data size in R is limited by RAM. I also mentioned the potential of working with relational data stores for even larger data, and made a vague proposal to demonstrate R/database interoperability in a future article. Well the future is now, and so an attempt to deliver on that promise.

I have two RDBMS's installed on my notebook. The first is the ubiquitous PostgreSQL, my preferred open source, row-based, relational database. The second is MonetDB, an under-the-radar columnar db.

I love both databases for different reasons. PostgreSQL is my go-to for evaluating interoperability with other analytics products such as R, SAS, Python, Tableau, and Power BI. And I've had good success working with MonetDB for some of my more intense analytics needs.

For my PostgreSQL servers, I primarily use the DBI and RPostgreSQL R libraries for access. Indeed, outside the graphical pgAdmin interface, most of my PostgreSQL work is via either RPostgreSQL or the psycopg2 package for Python.

MonetDB capabilities for R come in two flavors: the first is through the package MonetDB.R that connects to an installed MonetDB server. For an established database, MonetDB.R allows the R developer to execute SQL commands, push and pull data from MonetDB to R, load CSV data into a database, etc. With these capabilities, a programmer can work around R's memory limitations, subsetting larger data into R structures as needed. I've exercised MonetDB.R quite a bit and am a fan.

More lightweight and easier to use though is MonetDBLite for R, "a fully embedded version of MonetDB that installs like any other R package. The database runs within the R process itself (like its namesake SQLite), greatly improving efficiency of data transfers. The package is available for Linux, Mac OS X and Windows (64 bit)." Thus, with MonetDBLite, a MonetDB server needn't be present; only R packages DBI and MonetDBLite have to be installed. An additional benefit of RPostgreSQL, MonetDB.R, and MonetDBLite is the support for Hadley Wickham's powerful dplyr data management package commands in tandem with SQL.

For testing the capabilities of these databases in analytics, it's important to see the db serve data that at least starts out too large for memory. There's no reason to test database analytics capabilities in the MB size -- I know full well that such MB data will work fine in an R data.table or Pandas dataframe, without the support of a DBMS. I want my notebook to be uncomfortable -- to have to deliver on data larger than RAM.

For this series of blogs showing database access in R, I'm using the RPostgreSQL package on the PostgreSQL side, and MonetDBLite for MonetDB. dplyr and dbplyr from tidyverse extend the PostgreSQL/MonetDBLite capabilities.

An 128M+ record, 70 attribute Home Mortgage Disclosure Act (hmda) data set, conveniently available from the MonetDB web site, drives the tests. After downloading and unzipping the six annual "f" files, I was ready to go to work.

This first of the two blog series focuses on loading the raw hmda data into PostgreSQL and MonetDBLite respectively. Each of six raw data files are first read, munged, and written to an R fst file for fast subsequent access. Once this process is complete, the dbWriteTable functions in RPostgreSQL and MonetDBLite push the data to the respective databases. While using specific database bulk loaders would induce the fasted load times, first converting the csv data to fst format, then using the same load function for both databases, simplifies the work -- and is fast enough.

The remainder of this blog revolves on connecting to the respective data stores, then loading the data from delimited files, first into fst format, then to both PostgreSQL and MonetDB data stores in turn. A follow-up article will take a deep dive into the data, demonstrating SQL and dplyr/dbplyr "tbl" queries to build subsetted data.frames/data.tables and tabulate attribute frequencies. Important to note is that data must be loaded into the databases just once, where it persists. It can then be accessed repeatedly for subsequent analyses.

Read the entire blog here.

Views: 885

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by steve miller on June 7, 2018 at 11:50am

Pascal -- Thanks much for the update. I wasn't aware of MonetDB's embedding capability.

I've embedded R in PostgreSQL and am pretty sure it also works with Oracle, SQL Server, and others.

I'm apprehensive about committing to a specific database for R work, however, even though "optimal" performance may follow. My commitment is to R, and I prefer to be able to readily switch out database backends (eg MySQL). My code in both parts of the series will hopefully demonstrate that the R/database code can be made pretty seamless. I've actually added MariaDB to the tests though I won't report on it.

Thanks again for your note.

Comment by Pascal Heus on June 7, 2018 at 11:27am

Great to see MonetDB mentioned here, it is indeed a hidden gem. Couple of things I would like to add to your list of features.

One is that you can also run R embedded in MonetDB, which alleviates the need to move data back and forth. This was introduced in late 2014 (https://www.monetdb.org/content/embedded-r-monetdb). MonetDBLite for R is a great option, but if you run your server standalone this can be useful.

MonetDB is also one of the rare RDBMS that does not limit the number of columns per table. It is not a widely known fact but most SQL servers won't allow you to create tables with more than a thousand columns os so (we have documented some examples at http://goo.gl/639Rju). While such datasets are rare and typically result from heavy denormalization, they can be challenging to load in most SQL server (you can split across multiple tables and have an overarching view, or renormalize them).

Thanks for your post!

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2018   Data Science Central™   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service