In my previous article, we analyzed the COVID-19 data of Turkey and selected the cubic model for predicting the spread of disease. In this article, we will show in detail why we selected the cubic model for prediction and see whether our decision was right or not.

When we analyze the regression trend models we should consider overfitting and underfitting situations; **underfitting** indicates high bias and low variance while **overfitting** indicates low bias and high variance.

**Bias** is the difference between the expected value of fitted values and observed values:

The **variance** of fitted values is the expected value of squared deviation from the mean of fitted values:

The **adjusted coefficient of determination** is used in the different degrees of polynomial trend regression models comparing. In the below formula **p** denotes the number of explanatory terms and **n** denotes the number of observations.

**SSE** is the residual sum of squares:

**SST** is the total sum of squares:

When we examine the above formulas, we can notice the similarity between SSE and bias. We can easily say that if bias decreases, SSE will decrease and the adjusted coefficient of determination will increase. So we will use the adjusted R-squared instead of bias to balance with variance and find the optimal degree of the polynomial regression.

The dataset and data frame(tur) we are going to use can be found from the previous article. First of all, we will create all the polynomial regression models which we're going to compare.

a <- 1:15 models <- list()

for (i in seq_along(a)){

models[[i]] <- assign(paste0('model_', i),lm(new_cases ~ poly(index, i), data=tur))

}

The variances of fitted values of all the degrees of polynomial regression models:

variance <- c()

for (i in seq_along(a)) {

variance[i] <- mean((models[[i]][["fitted.values"]]-mean(models[[i]][["fitted.values"]]))^2)

}

To create an adjusted R-squared object we first create a summary object of the trend regression models; because the adj.r.squared feature is calculated in summary function.

models_summ <- list()

adj_R_squared <- c()

for (i in seq_along(a)) {

models_summ[[i]] <- summary(models[[i]])

adj_R_squared[i] <- (models_summ[[i]][["adj.r.squared"]])

}

Before analyzing numeric results of variance and adjusted R-squared we will show all the trend regression lines in separate plots for comparing models.

# Facetting models by degree for finding the best fit

library(ggplot2)

dat <- do.call(rbind, lapply(1:15, function(d) {

x <- tur$index

preds <- predict(lm(new_cases ~ poly(x, d), data=tur), newdata=data.frame(x=x))

data.frame(cbind(y=preds, x=x, degree=d)) }))

ggplot(dat, aes(x,y)) +

geom_point(data=tur, aes(index, new_cases)) +

geom_line(color="steelblue", lwd=1.1)+

facet_wrap(~ degree,nrow = 3)

When we examine the above plots, we should pay attention to the curved of the tails; because it indicates overfitting which shows extreme sensitivity to the observed data points. In light of this approach, the second and third-degree of models appear to be more convenient to the data.

Let's examine the adjusted R-squared-variance tradeoff on the plot we created below.

library(gridExtra)

plot_variance <- ggplot(df_tradeoff,aes(degree,variance))+

geom_line(size=2,colour="orange")+

scale_x_continuous(breaks = 1:15)+

theme_bw(base_line_size = 2)

plot_adj.R.squared <- ggplot(df_tradeoff,aes(degree,adj_R_squared))+

geom_line(size=2,colour="steelblue")+

labs(y="Adjusted R-Squared")+

scale_x_continuous(breaks = 1:15)+

theme_bw(base_line_size = 2)

grid.arrange(plot_variance,plot_adj.R.squared,ncol=1)

As we can see above, adjusted R-squared and variance have very similar trend lines. The more adjusted R-squared means the more complexity and low bias, but we have to take into account the variance; otherwise, we fall into the overfitting trap. So we need to look for low bias (or high adjusted R-squared) and low variance as much as possible for optimal selection.

When we examine the variance plot, we can see that there is not much difference between second and third-degree; but after the third degree, there seems to be a more steep increase; it's some kind of breaking. This might lead to overfitting. Thus the most reasonable option seems to be the 3rd degree. This approach is not a scientific fact but could be used for an optimal solution.

Original post can be viewed here

© 2020 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central