We had the chance to use the NFL play by play dataset all the way from 2002 through 2013 and the best part is the analysis was carried within Hadoop using Cloudera Impala.
For the analysis we wanted to be at the individual game level but the data contained mixed grain including the play by play data. So what we ended up doing was apply some SQL filters to restrict it to the first row of each play by play dataset.
Here are some interesting insights
1. Who won the most NFL games?
This was basically grouping the winners, summarizing the count of wins and sorting it in descending order.
2. Who won the most games against which losing team?
Click to see larger image
You can see the winning team going down on the y axis and the losing team horizontal on the x axis. Each intersection represents the number of times the winning team has won against that particular losing team from 2002 through 2013
3. Is there a correlation between playing surface and the winning team?
Check out the detailed Analysis of NFL play by play dataset