Several years ago, I came across a report that led me to a new data set revolving on Medicare payment and utilization for physicians and other health care providers. The data created quite a storm, highlighting what appear to be extreme payments to individual physicians and practices. In 2012, just 100 physicians received 160 M in Medicare reimbursements, with one Florida ophthalmologist netting over 21 M.
To get a feel for the data, I confirmed much of the analyses that were published early on by the mainstream media. Over time, my interest in the data grew, as additional annual files were added. At this point, there are four text files, 2012-2015, each in excess of 9 M records. An analysis pattern for these data can be summarized as follows:
1) readily downloadable files, generally delimited or xls. These files can either be copied by hand or moved programmatically. 2) multiple files, often dimensioned by time or other variable(s). 3) a common, consistent format to the files, so that "reads" will work similarly on each and the data can be "stacked". 4) a structured file naming convention, either given or assigned, that provides dimensional info for data loads. Date/Time is the most common cut.
My interest in the data has as much to do with its format as it's content. Those familiar with R have not doubt been exposed to the factor data type, used to store categorical or ordinal data. Factors consist of levels and labels, and are represented as one integer per record signifying the level and pointing to the relevant character label that is stored only once. Factors "compete" with character attributes, since it's generally true that a factor can be stored as a character and vice-versa. Historically, factors have been used mostly to represent dimensional attributes such as gender, race, or income category, but in theory at least, there may be an opportunity for factor variables to save storage for any character column where there are relatively few unique column values compared to the total number of records.
With over 37 M records and 30 attributes as of today, the size alone of this data creates challenges which can help answer questions that smaller fabricated data sets cannot. And many of the attributes such as name, street address, and city are inherently character, allowing testing as to how they might optimally be stored. So for me, the decision to experiment with the character storage options was a no-brainer.
To conduct the tests, I downloaded the four annual files to my notebook, developing scripts using Jupyter Notebook and Microsoft R 3.4.3. With the R data.table package, I created two versions of the medicarephysician structure, the first storing character columns as character, the second storing the same as factors. I compared the memory requirements of each data.table, and contrasted performance/size with writing/reading of serializable export files.
The results follow.....
Read the entire article here.