I enjoy data prep munging for analyses with computational platforms such as R, Python-Pandas, Julia, Apache Spark, and even relational databases. The wrangling cycle provides the opportunity to get a feel for and preliminarily explore data that are to be later analyzed/modeled.
A critical task I prefer handling in computation over database is data loading. This is because databases generally demand the table be created before records are inserted, while computational platforms can create/load data structures simultaneously, inferring attribute data types on the fly (these datatypes can also be overridden in load scripts). When data are sourced from files available on the web, the latter can afford a significant work savings. Examples of such web data relevant for me currently are Census and Medicare Provider Utilization and Payment. Each of these has both many records and attributes.
A challenge I recently gave myself was loading a 15.8M record, 286 attribute census data file into PostgreSQL on my notebook. It's easy to work with the PostgreSQL data loading capability after the table of 286 columns is created, but how to easily formulate the create table statement?
One possibility is to use the database connect features provided by the computational platforms. Both Python-Pandas and R offer PostgreSQL libraries, and each supports copying dataframes directly to PostgreSQL tables. Alas, this option is only resource-feasible for small structures. There are also adjunct libraries such as sqlalchemy and d6tstack available for Python that enhance dataframe to PostgreSQL copy performance, but for "large" data these ultimately disappoint as well.
As I struggled with divining a workable solution, the idea occurred to me to consider the computational copy feature to create the table with a very small subset of the data, then bulk load the data files into the database using efficient PostgreSQL copy commands. Turns out I was able to combine these computational and database load capabilities for a workable solution.
The strategy I adopted is as follows: 1) use Python-Pandas and R-data.table to load a small subset of data into dataframes to determine datatypes; 2) leverage that dataframe/data.table to create a relational database table in PostgreSQL; 3) generate bulk load sql copy commands along with shell scripts based on meta-data and csv files; 4) execute the shell scripts using a variant of a system command to load data with the efficient copy statements.
Up first is Python-Pandas, presented below. A proof of concept only, there's no exception/error handling in the code. Hopefully, the ideas presented resonate.
The technology used is JupyterLab 0.35.4, Anaconda Python 3.7.3, Pandas 0.24.2, sqlalchemy 1.3.1, psycopg2 2.8.3, and d6tstack 0.1.9.
Read the entire blog here