After posting my last blog, I decided next to do a 2-part series on XGBoost, a versatile, highly-performant, inter-operable machine learning platform. I'm a big fan of XGBoost and other multi-language ML packages such as H20, but generally test them in R environments. So, this time I've chosen to work in Python.
My plan was to have the first blog revolve on data munging and exploration, with the second to focus on modeling. I picked the newly-released census 2017 American Community Survey (ACS), Public Use Microdata Samples ... data to analyze. I particularly like that PUMS is meaty, consisting of over 3M records and 250+ attributes, while simultaneously clean and meaningful. I can't wait to put the 5-year 2013-2017 data to test in January when it becomes available.
I got started with wrangling and exploring the data, using Python/Pandas for the data management lifting. The task I gave myself was to look at income, the target variable, as a function of potential explanatory attributes such as age, sex, race, education, and marital status. Working in a Python Jupyter notebook with R magic enabled, I'm able to interoperate between Python and R, taking advantage of R's splendid ggplot graphics package. For numeric targets, I extensively use R's dot plots to visualize relationships to the predictors.
After building my data sets and setting aside test for later use, I was left with about 1.5M qualified records for training and exploration. Once I'd reviewed frequencies for each attribute, I looked at breakdowns of income by attribute levels of predictors such as age and education. One particularly interesting factor I examined was the Public Use Microdata Area, puma, a geography dimension consisting of 2,378 statistical areas covering the U.S. In contrast to state, puma would seem to offer a much more granular geography grouping.
What an understatement! The between-puma differences in income are nothing short of stunning. I was so taken aback by what I was seeing, that, after triple checking the data/calculations. I decided to post a Part 0 write-up detailing some of the findings. The remainder of this blog outlines several of the analysis steps, starting with finalized training data to be detailed in Part 1 after the holidays. Part 2 will focus on modeling in XGBoost.
The technology is a Python kernel Jupyter notebook with R magic enabled.
Find the remainder of the blog here.
Comment
William --
Thanks for the note.
Couldn't agree with you more about the movement to cooperation between R and Python -- in both directions.
Also, ggplot is a mature, easy to use graphical specification language that meets many DS needs directly and can in addition serve to prototype more sophisticated visuals.
Best,
Steve
Steve,
Excellent blend of R and Python. You made the analysis phase almost seamless, taking advantage of both tools.
I think as time goes on, the two environments will become more "cooperative" and offer a set of tools that are better than commercial products. Good news for data scientists!
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central