After posting my last blog, I decided next to do a 2-part series on XGBoost, a versatile, highly-performant, inter-operable machine learning platform. I'm a big fan of XGBoost and other multi-language ML packages such as H20, but generally test them in R environments. So, this time I've chosen to work in Python.

My plan was to have the first blog revolve on data munging and exploration, with the second to focus on modeling. I picked the newly-released census 2017 American Community Survey (ACS), Public Use Microdata Samples ... data to analyze. I particularly like that PUMS is meaty, consisting of over 3M records and 250+ attributes, while simultaneously clean and meaningful. I can't wait to put the 5-year 2013-2017 data to test in January when it becomes available.

I got started with wrangling and exploring the data, using Python/Pandas for the data management lifting. The task I gave myself was to look at income, the target variable, as a function of potential explanatory attributes such as age, sex, race, education, and marital status. Working in a Python Jupyter notebook with R magic enabled, I'm able to interoperate between Python and R, taking advantage of R's splendid ggplot graphics package. For numeric targets, I extensively use R's dot plots to visualize relationships to the predictors.

After building my data sets and setting aside test for later use, I was left with about 1.5M qualified records for training and exploration. Once I'd reviewed frequencies for each attribute, I looked at breakdowns of income by attribute levels of predictors such as age and education. One particularly interesting factor I examined was the Public Use Microdata Area, puma, a geography dimension consisting of 2,378 statistical areas covering the U.S. In contrast to state, puma would seem to offer a much more granular geography grouping.

What an understatement! The between-puma differences in income are nothing short of stunning. I was so taken aback by what I was seeing, that, after triple checking the data/calculations. I decided to post a Part 0 write-up detailing some of the findings. The remainder of this blog outlines several of the analysis steps, starting with finalized training data to be detailed in Part 1 after the holidays. Part 2 will focus on modeling in XGBoost.

The technology is a Python kernel Jupyter notebook with R magic enabled.

Find the remainder of the blog here.

Views: 2122


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by steve miller on December 23, 2018 at 9:25am

William --

Thanks for the note.

Couldn't agree with you more about the movement to cooperation between R and Python -- in both directions.

Also, ggplot is a mature, easy to use graphical specification language that meets many DS needs directly and can in addition serve to prototype more sophisticated visuals.



Comment by William Holst on December 23, 2018 at 6:57am


Excellent blend of R and Python. You made the analysis phase almost seamless, taking advantage of both tools.

I think as time goes on, the two environments will become more "cooperative" and offer a set of tools that are better than commercial products. Good news for data scientists!

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service