Dataframe Storage Efficiency in Python-Pandas

Summary: It's no secret that Python-Pandas is central to data management for analytics and data science today. Indeed, what we're seeing now is Pandas being extended to handle ever-larger data. Underappreciated is that Pandas is a tunable platform, supporting its own datatypes as well as those from numerical library Numpy. Together, these comprise a quite granular collection of dataframe column options -- much more than R. Alas, most Pandas programmers don't take advantage of these possibilities, instead accepting the hefty defaults that accompany the read_csv function. These defaults might not be malevolent for record counts in the 10,000's, but could be ruinous for multi-million row, multi-hundred column dataframes. In this blog, I consider a 15.5+ million record, 250+ attribute census data set, contrasting the results of default data loading with optimally "downcast" attribute sizes. The findings, detailed below, are quite striking. An intermediate level of Python/Pandas programming sophistication is assumed of readers.

A few weeks ago, I was searching stackoverflow to see if/how to implement 16 and 8-bit integers in Pandas, when I came across a snarky exchange between two commenters on the need/utility of the granular datatypes I was researching. The "con" side, which argued against obsession with detailed data types, was no doubt accustomed to working with small data sets, where the difference in dataframe RAM between 16 and 32-bit attributes is measured in bytes, not megabytes. I could only smile as I downloaded the latest 5 year PUMS files from the US census. csv_hus.zip and csv_pus.zip are each comprised of 4 text files -- the first of data from a sample of US households, the second of a sample from the US population. Both files are in multi-million record, multi-hundred column size range. The difference between a 2-byte and 4-byte column for almost 16M rows is 30 MB RAM!

I highly recommend that data nerds look at PUMS data. In addition to its utility for assessing management challenges, the data provides a fascinating look into US households and the population. In fact, this is the same data used by researchers at think tanks like Brookings for their demographic forecasts. Incidentally, the analysis of PUMA's "special non-overlapping areas that partition each state into contiguous geographic units containing no fewer than 100,000 people each." confirms the striking economic differences in geographies now occupying our political discourse.

My self-directed task for this blog was to load the 4 files comprising the population data into a Pandas dataframe, and then to explore options for optimizing storage. The point of departure is a naive load of a Pandas dataframe, followed by application of a "downcast" function on relevant columns to reduce storage. I then outline a proof of concept to determine downcasting options based on a random sampling of records. My findings: the difference between default and downcasted loads in this instance is stark: 33 GB vs 10 GB!

The takeaway is that it's important to be fastidious about column types/size when loading large data files into Pandas. In a follow-up piece, I'll discuss the handling of character/category variables in Pandas, and how they relate to dataframe size/performance.

The technology used is Wintel 10 with 128 GB RAM, along with JupyterLab 1.2.4 and Python 3.7.5, plus foundation libraries Pandas 0.25.3 and Numpy 1.16.4. Feather 0.4.0 handles efficient disk storage of the dataframes.

See the remainder of the blog here.

Views: 2202

Tags: dsc_code, dsc_enterprise_data, dsc_tagged


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Lance Norskog on February 24, 2020 at 12:48pm

I have not used VAEX, just saw it in the blizzard of numerical libraries :)

Comment by steve miller on February 22, 2020 at 6:01am

thanks for bringing VAEX to my attention. have you worked with it? is it stable? performant? I'll definitely take a look.

Comment by steve miller on February 22, 2020 at 5:59am

I'm addressing category vars in a future blog. There can be storage and performance advantages for both R factors and Pandas categories even when the number of levels is in the thousands.

Comment by Lance Norskog on February 21, 2020 at 7:38pm

At some point VAEX might be what you want: Pandas API for data stored directly in memory-mapped files.


Comment by Subhash mantha on February 21, 2020 at 8:02am

There are additional benefits identifying a column as categorical when there are repeating values and the number of distinct values of the column is small. Most of the times the size of numeric data types is set to bit version of OS and it would help in downcasting.


© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service