Several years ago, I came across a report that led me to a new data set revolving on Medicare payment and utilization for physicians and other health care providers. The data created quite a storm, highlighting what appear to be extreme payments to individual physicians and practices. In 2012, just 100 physicians received 160 M in Medicare reimbursements, with one Florida ophthalmologist netting over 21 M.

To get a feel for the data, I confirmed much of the analyses that were published early on by the mainstream media. Over time, my interest in the data grew, as additional annual files were added. At this point, there are four text files, 2012-2015, each in excess of 9 M records. An analysis pattern for these data can be summarized as follows:

1) readily downloadable files, generally delimited or xls. These files can either be copied by hand or moved programmatically. 2) multiple files, often dimensioned by time or other variable(s). 3) a common, consistent format to the files, so that "reads" will work similarly on each and the data can be "stacked". 4) a structured file naming convention, either given or assigned, that provides dimensional info for data loads. Date/Time is the most common cut.

My interest in the data has as much to do with its format as it's content. Those familiar with R have not doubt been exposed to the factor data type, used to store categorical or ordinal data. Factors consist of levels and labels, and are represented as one integer per record signifying the level and pointing to the relevant character label that is stored only once. Factors "compete" with character attributes, since it's generally true that a factor can be stored as a character and vice-versa. Historically, factors have been used mostly to represent dimensional attributes such as gender, race, or income category, but in theory at least, there may be an opportunity for factor variables to save storage for any character column where there are relatively few unique column values compared to the total number of records.

With over 37 M records and 30 attributes as of today, the size alone of this data creates challenges which can help answer questions that smaller fabricated data sets cannot. And many of the attributes such as name, street address, and city are inherently character, allowing testing as to how they might optimally be stored. So for me, the decision to experiment with the character storage options was a no-brainer.

To conduct the tests, I downloaded the four annual files to my notebook, developing scripts using Jupyter Notebook and Microsoft R 3.4.3. With the R data.table package, I created two versions of the medicarephysician structure, the first storing character columns as character, the second storing the same as factors. I compared the memory requirements of each data.table, and contrasted performance/size with writing/reading of serializable export files.

The results follow.....

Read the entire article here.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central