Subscribe to DSC Newsletter

NFL Play by Play analysis using Cloudera Impala

We had the chance to use the NFL play by play dataset all the way from 2002 through 2013 and the best part is the analysis was carried within Hadoop using Cloudera Impala.

For the analysis we wanted to be at the individual game level but the data contained mixed grain including the play by play data. So what we ended up doing was apply some SQL filters to restrict it to the first row of each play by play dataset.

Here are some interesting insights

1. Who won the most NFL games?

This was basically grouping the winners, summarizing the count of wins and sorting it in descending order.

2. Who won the most games against which losing team?

Click to see larger image

You can see the winning team going down on the y axis and the losing team horizontal on the x axis. Each intersection represents the number of times the winning team has won against that particular losing team from 2002 through 2013

3. Is there a correlation between playing surface and the winning team?

Check out the detailed Analysis of NFL play by play dataset

Views: 1533


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Nilesh Jethwa on June 8, 2015 at 10:29am

Irv, that is a good idea. It should probably include a new metric that is relative to the number of games played between two teams.

Comment by Irv Lustig on June 8, 2015 at 10:22am

There is a problem with the visualization of the "Who won the most games against which losing team?", because it is looking at absolute wins, and teams play other teams in their division twice per year, but teams outside their division much less often.  The visualization should reflect a value relative to the number of games played between two teams.


  • Add Videos
  • View All

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service