Summary: It's no secret that Python-Pandas is central to data management for analytics and data science today. Indeed, what we're seeing now is Pandas being extended to handle ever-larger data. Underappreciated is that Pandas is a tunable platform, supporting its own datatypes as well as those from numerical library Numpy. Together, these comprise a quite granular collection of dataframe column options -- much more than R. Alas, most Pandas programmers don't take advantage of these possibilities, instead accepting the hefty defaults that accompany the read_csv function. These defaults might not be malevolent for record counts in the 10,000's, but could be ruinous for multi-million row, multi-hundred column dataframes. In this blog, I consider a 15.5+ million record, 250+ attribute census data set, contrasting the results of default data loading with optimally "downcast" attribute sizes. The findings, detailed below, are quite striking. An intermediate level of Python/Pandas programming sophistication is assumed of readers.
A few weeks ago, I was searching stackoverflow to see if/how to implement 16 and 8-bit integers in Pandas, when I came across a snarky exchange between two commenters on the need/utility of the granular datatypes I was researching. The "con" side, which argued against obsession with detailed data types, was no doubt accustomed to working with small data sets, where the difference in dataframe RAM between 16 and 32-bit attributes is measured in bytes, not megabytes. I could only smile as I downloaded the latest 5 year PUMS files from the US census. csv_hus.zip and csv_pus.zip are each comprised of 4 text files -- the first of data from a sample of US households, the second of a sample from the US population. Both files are in multi-million record, multi-hundred column size range. The difference between a 2-byte and 4-byte column for almost 16M rows is 30 MB RAM!
I highly recommend that data nerds look at PUMS data. In addition to its utility for assessing management challenges, the data provides a fascinating look into US households and the population. In fact, this is the same data used by researchers at think tanks like Brookings for their demographic forecasts. Incidentally, the analysis of PUMA's "special non-overlapping areas that partition each state into contiguous geographic units containing no fewer than 100,000 people each." confirms the striking economic differences in geographies now occupying our political discourse.
My self-directed task for this blog was to load the 4 files comprising the population data into a Pandas dataframe, and then to explore options for optimizing storage. The point of departure is a naive load of a Pandas dataframe, followed by application of a "downcast" function on relevant columns to reduce storage. I then outline a proof of concept to determine downcasting options based on a random sampling of records. My findings: the difference between default and downcasted loads in this instance is stark: 33 GB vs 10 GB!
The takeaway is that it's important to be fastidious about column types/size when loading large data files into Pandas. In a follow-up piece, I'll discuss the handling of character/category variables in Pandas, and how they relate to dataframe size/performance.
The technology used is Wintel 10 with 128 GB RAM, along with JupyterLab 1.2.4 and Python 3.7.5, plus foundation libraries Pandas 0.25.3 and Numpy 1.16.4. Feather 0.4.0 handles efficient disk storage of the dataframes.
See the remainder of the blog here.