Contributed by Daniel Donohue. Daniel took NYC Data Science Academy 12 weeks bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).
# Load the required packages, the datasets, and create character vectors of
# dome and outdoor teams.
library(dplyr)
library(reshape2)
library(ggplot2)
game <- read.csv("nfl_00-14/csv/game.csv", stringsAsFactors = FALSE)
team <- read.csv("nfl_00-14/csv/team.csv", stringsAsFactors = FALSE)
dome.teams <- c("ATL", "MIN", "NO", "STL", "DET", "IND", "ARI", "HOU")
outdoor.teams <- unique(filter(game, !(h %in% dome.teams))$h)
# Add a column to the game dataframe for final margin from the perspective of
# the visting team. A negative final margin indicates that the visiting
# team lost.
game <- mutate(game, v.margin=ptsv - ptsh)
# Create an object containing instances of dome teams playing in open-air
# stadiums, and outdoor teams playing away. Note that the Dallas Cowboys
# moved from an open-air stadium to a dome in the 2009 season.
dome.at.outdoor <- filter(game,
(v %in% dome.teams | (v == "DAL" & seas > 2008)) &
(h %in% outdoor.teams))
outdoor.away <- filter(game,
(v %in% outdoor.teams | (v == "DAL" & seas <= 2008)))
dome.seas.margin <- group_by(dome.at.outdoor, seas) %>%
summarise(avg.away.margin=mean(v.margin))
outdoor.seas.margin <- group_by(outdoor.away, seas) %>%
summarise(avg.away.margin=mean(v.margin))
# Melt these into a single dataframe for ggplot.
away.margin <- melt(list(dome.seas.margin, outdoor.seas.margin),
id.var=c("seas", "seas"))
avg.visiting.margin <- ggplot(data=away.margin, aes(x=seas, y=value,
colour=factor(L1), group=factor(L1))) +
geom_line() +
scale_color_manual(name="Away Margin in the NFL",
breaks=c(1, 2),
labels=c("Dome Teams \n Playing Outdoors",
"Outdoor Stadium Teams' \n Away Margin"),
values=c('blue', 'orange')) +
theme_bw() +
xlab("Season") +
ylab("Average Margin") +
ggtitle("Average Margin of Victory in the NFL, 2000-2014")
avg.visiting.margin
Next, we want to see what the distribution of away margins are for dome teams and the rest of the NFL. Again, we first need to prepare a dataframe for visualization.
# Melt the dome and game dataframes into a single dataframe.
away.all <- melt(list(dome.at.outdoor, outdoor.away), id.vars=c("gid", "gid"),
measure.vars=c("v.margin", "v.margin"))
# Calculate the average margins because I'm going to overlay these on the
# density plots.
mean.away.all <- group_by(away.all, L1) %>%
summarise(mean.val=mean(value)) %>%
select(L1, mean.val)
And we obtain this:
visiting.density <- ggplot(data=away.all, aes(x=value,
fill=factor(L1))) +
geom_density(alpha=.2) +
scale_fill_manual(name="",
breaks=c(1, 2),
labels=c("Dome Teams \nPlaying Outdoors",
"Outdoor Stadium \nTeams Away"),
values=c("blue", "orange")) +
geom_vline(data=mean.away.all, aes(xintercept=mean.val,
colour=factor(L1)), linetype="dashed", size=.75, alpha=.5) +
scale_colour_manual(breaks=c(1, 2), values=c("blue", "orange")) +
theme_bw() +
xlab("Final Margin") +
ylab("Density") +
geom_text(data=mean.away.all, aes(x=4.1, y=.0375, label="-1.94 pts/game"),
color='orange', size=5) +
geom_text(data=mean.away.all, aes(x=-10, y=.0375, label="-4.04 pts/game"),
color='blue', size=5) +
ggtitle("Density Plots of Away Margin in the NFL, 2000-2014")
visiting.density
The distribution for outdoor teams appears fairly normal. The distribution for dome teams has somewhat of a negative skew, which indicates that they tend to be on the receiving end of more blowouts. This seems to be in line with the initial hypothesis that dome teams indeed perform worse on the road than the rest of the league, but is this difference in means statistically significant? To get an idea, we can perform a two-sample t-test, with the alternative hypothesis that the true value of the average dome team away margin is less than that of the rest of the NFL.
t.test(dome.at.outdoor$v.margin, outdoor.away$v.margin, alternative = "less")
##
## Welch Two Sample t-test
##
## data: dome.at.outdoor$v.margin and outdoor.away$v.margin
## t = -3.68, df = 1247.1, p-value = 0.0001216
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -1.16323
## sample estimates:
## mean of x mean of y
## -4.047736 -1.943072
The p-value is extremely small. We are therefore led to reject the null hypothesis that the means are the same across the two groups, in favor of the alternative hypothesis that the mean visiting margin of dome teams playing outdoors is significantly less than the mean visiting margin for the rest of the NFL.
weather = function(x) {
if(x %in% c("Chance Rain", "Light Rain", "Rain", "Thunderstorms")) {
return("Rain")
}
else if(x %in% c("Clear", "Fair", "Partly Sunny", "Mostly Sunny", "Sunny")) {
return("Clear")
}
else if(x %in% c("Closed Roof", "Dome")) {
return("Dome")
}
else if(x %in% c("Cloudy", "Partly Cloudy", "Mostly Cloudy")) {
return("Cloudy")
}
else if(x %in% c("Foggy", "Hazy")) {
return("Fog")
}
else {
return("Snow")
}
}
dome.weather <- inner_join(dome.at.outdoor, team, "gid") %>%
# Filter out the outdoor teams that are hosting the dome teams.
filter(tname %in% dome.teams | (tname == "DAL" & seas > 2008)) %>%
# Select weather condition, visiting margin, points scores, rushing first downs,
# passing first downs, first downs obtained through penalties, rushing yardage,
# passing yardage, penalties committed, red zone attempts, red zone conversions,
# short third-down attempts, short third-down conversions, long third-down attempts, and
# long third-down conversions.
select(cond, v.margin, pts, rfd, pfd, ifd, ry, py, pen,
rza, rzc, s3a, s3c, l3a, l3c) %>%
# Add columns for total first downs, red zone efficiency, and adjusted third-down
# efficiency (which lends greater weight to long third-down conversions).
mutate(fd=rfd + pfd + ifd, rze=rzc / rza,
a3e= (s3c + 1.5 * l3c) / (s3a + l3a)) %>%
# Drop irrelevant columns.
select(-c(rfd, pfd, ifd, rza, rzc, s3a, s3c, l3a, l3c)) %>%
filter(cond != "") %>% # A few rows didn't have a condition recorded.
# Replace NaNs with 0 for teams that had no red zone attempts, and group
# similar weather conditions using the above-defined function.
mutate(rze=ifelse(is.nan(rze), 0, rze), cond=sapply(cond, weather)) %>%
group_by(cond) %>%
summarise_each(funs(mean))
dome.weather <- melt(dome.weather, id.vars="cond")
levels(dome.weather$variable) <- c("Margin", "Points", "Rushing Yards",
"Passing Yards", "Penalty Yards", "First Downs", "Red Zone Efficiency",
"Adjusted Third Down Rate")
weather.stats <- ggplot(data=dome.weather, aes(x=factor(cond), y=value,
fill=factor(cond))) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~variable, nrow=3, scales="free") +
scale_fill_brewer(name="Condition", palette="RdYlBu") +
xlab("") +
ylab("") +
theme_bw() +
ggtitle("Dome Team Statistics in Different Conditions")
weather.stats
For completeness, we can create the same visualization for outdoor teams playing in various weather conditions.
Note the scale on the y-axes. Contrasted with dome teams, we do not see as substantial a drop off in statistics in inclement weather; in fact, we see that, for instance, outdoor teams rush the ball better in snow and rain.
dome.temp <- inner_join(dome.at.outdoor, team, "gid") %>%
mutate(tot.yds = ry + py) %>%
select(temp, tot.yds) %>%
filter(temp < 32)
outdoor.temp <- inner_join(outdoor.away, team, "gid") %>%
mutate(tot.yds = ry + py) %>%
select(temp, tot.yds) %>%
filter(temp < 32)
# Melt for ggplot2.
yds.temp <- melt(list(dome.temp, outdoor.temp),
id.var = c("temp", "temp"))
# Plot.
temp.smooth <- ggplot(data=yds.temp, aes(x=temp, y=value,
color=factor(L1), group=factor(L1))) +
geom_smooth(na.rm=TRUE, alpha=.1) +
scale_color_manual(name="", breaks=c(1, 2),
labels=c("Dome Teams",
"Outdoor Stadium Teams"),
values=c('blue', 'orange')) +
theme_bw() +
xlab("Temperature") +
ylab("Total Yards") +
ggtitle("Yardage in Low Temperatures") +
xlim(10, 32)
temp.smooth
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central