Subscribe to DSC Newsletter

An Exploration of Perkins Loan Default Rate Data

Contributed by Gordon. Gordon took NYC Data Science Academy 12 week full time Data Science  bootcamp  ... between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).


In the news cycle driven cog of modern society, we often get caught up in whatever is being discussed by the talking heads on whichever screen we're looking at. When these avatars cease speaking about an issue, it often disappears from our consciousness as well - especially if it doesn't particularly affect us. Out of sight, out of mind, as it were.

The student debt crisis is one of the issues caught in this revolving door, and it also happens to be one of the biggest issues facing a large portion of young Americans. Unofficial counters place the total student debt at well over one trillion dollars. One of this burden's subdivisions are the categories of private and federal loan.

Unsurprisingly, federal loan programs are generally more lenient than those provided by the private sector. Before this exploration, the only Federal loan programs I knew about before doing this project were TAP and FAFSA. It is through the data used in this project that I learned about the Perkins Loan.


Armed with data from 2011-2014, I began my analysis.

The End Goal

My goal in this first project was to explore the default faults associated with this loan through various visualizations.


The data source has nine years of data but only the data for the three most recent years were not in pdf form. Still, the xlsx files provided were flush with superfluous trappings like conditional formatting and colors. All of those had to be removed before I could load the data into R. Once that manual labor was done, the real work began.

My first round of data cleaning mostly involved merging the data into one. That involved some sensible renaming of columns and adding an additional column to tag the associated year of each data point.

rename.columns = c('Serial',

names(perkins1112) = names(perkins1213) = names(perkins1314) = rename.columns

perkins1314$year='13-14' = rbind(perkins1112, perkins1213, perkins1314)

My second round of cleaning involved introducing state-level granularity. It was important here to apply a per capita scaling since the data would involve a lot of comparisons between states of varying populations. A quick visit to the US Census Bureau's website provided the necessary csv files. It is at this point that some error was intentionally included. The census data is for the calendar year, but the Perkin's loan data is based on the school year. I averaged the population of the consecutive years in question to try to match up the two data files as closely as possible, and then merged them. My data to work with was ready.

data(state.regions) = merge(, state.regions, by.x='ST', by.y='abb') = tbl_df(


In terms of structuring the flow of my visualizations, I decided to go from least to most granularity. I started off with looking at the yearly trend of money owed by those in severe default, which, unsurprisingly, increased year on year.


A similar temporal visualization based on the number of borrowers in severe default showed a similar trend.


Next I looked at state-level data. Using the chloroplethR package, I made a series of chloropleth maps from this state-level data for the three years of data. The gif below shows the default rate in each state over three years.


And the second looked at the average amount of money owed by those in default for more than 240 days.


To end my exploration I went to the lowest level of granularity and looked at all the colleges across all three years as a whole. The highlighted colleges are the ones I thought were interesting, but special emphasis goes to those with a low number of borrowers but a high principal owed.


The CUNY system in New York and Devry in Chicago stood out, as did Johnson and Wales University in Pennsylvania. In fact, the Philadelphia based institution had the distinction of a high volume of loans for a comparatively low number of borrowers.

Looking at individual states North Dakota had the most money owed scaled by population, so I had a look at those colleges.


California was at the other extreme with the least money owed per one million people.


New York is at forty-eight by the same metric.


Comparing New York and California brings up an interesting observation. The data references New York's city college system as a whole while California's equivalent system has its college listed individually. This brings into question the way the data was reported by the colleges, and make one wonder if the government shouldn't establish a standard across the board.


My analysis showed that the northwest of the United States is the mostly severely indebted to the Perkins loan program, with a few states like Maine and Delaware being in a similar state. For the most part, most states seem to have their loans under control from an wide perspective.

This only provides a snapshot of the loan crisis. For the analysis to be hard-hitting I need more data. Economic data of each state would be useful, as would be tuition costs and estimates for cost of living. I hope to expand this analysis in the future.


Here are the slides from my presentation:

And the link to my code:

Views: 162


You need to be a member of Data Science Central to add comments!

Join Data Science Central


  • Add Videos
  • View All

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service