Data Science Simplified Part 4: Simple Linear Regression Models

In the previous posts of this series, we discussed the concepts of statistical learning and hypothesis testing. In this article, we dive into linear regression models.
Before we dive in, let us recall some important aspects of statistical learning.

Independent and Dependent variables:

In the context of Statistical learning, there are two types of data:

Independent variables: Data that can be controlled directly.
Dependent variables: Data that cannot be controlled directly.

The data that can’t be controlled i.e. dependent variables need to predicted or estimated.

Model:

A model is a transformation engine that helps us to express dependent variables as a function of independent variables.

Parameters:

Parameters are ingredients added to the model for estimating the output.

Concept

Linear regression models provide a simple approach towards supervised learning. They are simple yet effective.

Wait, what do we mean by linear?

Linear implies the following: arranged in or extending along a straight or nearly straight line. Linear suggests that the relationship between dependent and independent variable can be expressed in a straight line.

Recall the geometry lesson from high school. What is the equation of a line?

y = mx + c

Linear regression is nothing but a manifestation of this simple equation.

y is the dependent variable i.e. the variable that needs to be estimated and predicted.
x is the independent variable i.e. the variable that is controllable. It is the input.
m is the slope. It determines what will be the angle of the line. It is the parameter denoted as β.
c is the intercept. A constant that determines the value of y when x is 0.

George Box, a famous British statistician, once quoted:

“All models are wrong; some are useful.”

Linear regression models are not perfect. It tries to approximate the relationship between dependent and independent variables in a straight line. Approximation leads to errors. Some errors can be reduced. Some errors are inherent in the nature of the problem. These errors cannot be eliminated. They are called as an irreducible error, the noise term in the true relationship that cannot fundamentally be reduced by any model.

The same equation of a line can be re-written as:

Data Science Simplified Part 4: Simple Linear Regression Models

β0 and β1 are two unknown constants that represent the intercept and slope. They are the parameters.

ε is the error term.

Formulation

Let us go through an example to explain the terms and workings of a Linear regression model.

Fernando is a Data Scientist. He wants to buy a car. He wants to estimate or predict the car price that he will have to pay. He has a friend at a car dealership company. He asks for prices for various other cars along with a few characteristics of the car. His friend provides him with some information.

The following are the data provided to him:

make: make of the car.
fuelType: type of fuel used by the car.
nDoor: number of doors.
engineSize: size of the engine of the car.
price: the price of the car.

First, Fernando wants to evaluate if indeed he can predict car price based on engine size. The first set of analysis seeks the answers to the following questions:

Is price of car price related with engine size?
How strong is the relationship?
Is the relationship linear?
Can we predict/estimate car price based on engine size?

Fernando does a correlation analysis. Correlation is a measure of how much the two variables are related. It is measured by a metric called as the correlation coefficient. Its value is between 0 and 1.

If the correlation coefficient is a large(> 0.7) +ve number, it implies that as one variable increases, the other variable increases as well. A large -ve number indicates that as one variable increases, the other variable decreases.

He does a correlation analysis. He plots the relationship between price and engine size.

He splits the data into training and test set. 75% of data is used for training. Remaining is used for the test.

He builds a linear regression model. He uses a statistical package to create the model. The model creates a linear equation that expresses price of the car as a function of engine size.

Following are the answers to the questions:

Is price of car price related with engine size?

Yes, there is a relationship.

How strong is the relationship?

The correlation coefficient is 0.872 => There is a strong relationship.

Is the relationship linear?

A straight line can fit => A decent prediction of price can be made using engine size.

Can we predict/estimate the car price based on engine size?

Yes, car price can be estimated based on engine size.

Fernando now wants to build a linear regression model that will estimate the price of the car price based on engine size. Superimposing the equation to the car price problem, Fernando formulates the following equation for price prediction.

price = β0 + β1 x engine size

Model Building and Interpretation

Model

Recall the earlier discussion, on how the data needs to be split into training and testingset. The training data is used to learn about the data. The training data is used to create the model. The testing data is used to evaluate the model performance.

Fernando splits the data into training and test set. 75% of data is used for training. Remaining is used for the test. He builds a linear regression model. He uses a statistical package to create the model. The model produces a linear equation that expresses price of the car as a function of engine size.

He splits the data into training and test set. 75% of data is used for training. Remaining is used for the test.

He builds a linear regression model. He uses a statistical package to create the model. The model creates a linear equation that expresses price of the car as a function of engine size.

The model estimates the parameters:

β0 is estimated as -6870.1
β1 is estimated as 156.9

The linear equation is estimated as:

price = -6870.1 + 156.9 x engine size

Interpretation

The model provides the equation for the predicting the average car price given a specific engine size. This equation means the following:

One unit increase in engine size will increase the average price of the car by 156.9 units.

Evaluation

The model is built. The robustness of the model needs to be evaluated.

How can we be sure that the model will be able to predict the price satisfactory?

This evaluation is done in two parts. First, test to establish the robustness of the model. Second, test to evaluate the accuracy of the model.

Fernando first evaluates the model on the training data. He gets the following statistics.

There are a lot of statistics in there. Let us focus on key ones (marked in red). Recall the discussion on hypothesis testing. The robustness of the model is evaluated using hypothesis testing.

H0 and Ha need to be defined. They are defined as follows:

H0 (NULL hypothesis): There is no relationship between x and y i.e. there is no relationship between price and engine size.
Ha (Alternate hypothesis): There is some relationship between x and y i.e. there is a relationship between price and engine size.

β1: The value of β1 determines the relationship between price and engine size. If β1 = 0 then there is no relationship. In this case, β1 is positive. It implies that there is some relationship between price and engine size.

t-stat: The t-stat value is how many standard deviations the coefficient estimate (β1) is far away from zero. Further, it is away from zero stronger the relationship between price and engine size. The coefficient is significant. In this case, t-stat is 21.09. It is far enough from zero.

p-value: p-value is a probability value. It indicates the chance of seeing the given t-statistics, under the assumption that NULL hypothesis is true. If the p-value is small e.g. < 0.0001, it implies that the probability that this is by chance and there is no relation is very low. In this case, the p-value is small. It means that relationship between price and engine is not by chance.

With these metrics, we can safely reject the NULL hypothesis and accept the alternate hypothesis. There is a robust relationship between price and engine size

The relationship is established. How about accuracy? How accurate is the model? To get a feel for the accuracy of the model, a metric named R-squared or coefficient of determination is important.

R-squared or Coefficient of determination: To understand these metrics, let us break it down into its component.

Error (e) is the difference between the actual y and the predicted y. The predicted y is denoted as ŷ. This error is evaluated for each observation. These errors are also called as residuals.
Then all the residual values are squared and added. This term is called as Residual Sum of Squares (RSS). Lower the RSS, the better it is.
There is another part of the equation of R-squared. To get the other part, first, the mean value of the actual target is computed i.e. average value of the price of the car is estimated. Then the differences between the mean value and actual values are calculated. These differences are then squared and added. It is the total sum of squares (TSS).
R-squared a.k.a coefficient of determination is computed as 1- RSS/TSS. This metric explains the fraction of the variance between the values predicted by the model and the value as opposed to the mean of the actual. This value is between 0 and 1. The higher it is, the better the model can explain the variance.

Let us look at an example.

In the example above, RSS is computed based on the predicted price for three cars. RSS value is 41450201.63. The mean value of the actual price is 11,021. TSS is calculated as 44,444,546. R-squared is computed as 6.737%. For these three specific data points, the model is only able to explain 6.73% of the variation. Not good enough!!

However, for Fernando’s model, it is a different story. The R-squared for the training set is 0.7503 i.e. 75.03%. It means that the model can explain more 75% of the variation.

Conclusion

Voila!! Fernando has a good model now. It performs satisfactorily on the training data. However, there is 25% of data unexplained. There is room for improvement. How about adding more independent variable for predicting the price? When more than one independent variables are added for predicting a dependent variable, a multivariate regression model is created i.e. more than one variable.

The next installment of this series will delve more into the multivariate regression model. Stay tuned.

This post was first published in here.