In the previous posts of this series, we discussed the concepts of statistical learning and hypothesis testing. In this article, we dive into linear regression models.
Before we dive in, let us recall some important aspects of statistical learning.
Independent and Dependent variables:
In the context of Statistical learning, there are two types of data:
The data that can’t be controlled i.e. dependent variables need to predicted or estimated.
Model:
A model is a transformation engine that helps us to express dependent variables as a function of independent variables.
Parameters:
Parameters are ingredients added to the model for estimating the output.
Linear regression models provide a simple approach towards supervised learning. They are simple yet effective.
Wait, what do we mean by linear?
Linear implies the following: arranged in or extending along a straight or nearly straight line. Linear suggests that the relationship between dependent and independent variable can be expressed in a straight line.
Recall the geometry lesson from high school. What is the equation of a line?
y = mx + c
Linear regression is nothing but a manifestation of this simple equation.
George Box, a famous British statistician, once quoted:
“All models are wrong; some are useful.”
Linear regression models are not perfect. It tries to approximate the relationship between dependent and independent variables in a straight line. Approximation leads to errors. Some errors can be reduced. Some errors are inherent in the nature of the problem. These errors cannot be eliminated. They are called as an irreducible error, the noise term in the true relationship that cannot fundamentally be reduced by any model.
The same equation of a line can be re-written as:
β0 and β1 are two unknown constants that represent the intercept and slope. They are the parameters.
ε is the error term.
Let us go through an example to explain the terms and workings of a Linear regression model.
Fernando is a Data Scientist. He wants to buy a car. He wants to estimate or predict the car price that he will have to pay. He has a friend at a car dealership company. He asks for prices for various other cars along with a few characteristics of the car. His friend provides him with some information.
The following are the data provided to him:
First, Fernando wants to evaluate if indeed he can predict car price based on engine size. The first set of analysis seeks the answers to the following questions:
Fernando does a correlation analysis. Correlation is a measure of how much the two variables are related. It is measured by a metric called as the correlation coefficient. Its value is between 0 and 1.
If the correlation coefficient is a large(> 0.7) +ve number, it implies that as one variable increases, the other variable increases as well. A large -ve number indicates that as one variable increases, the other variable decreases.
He does a correlation analysis. He plots the relationship between price and engine size.
He splits the data into training and test set. 75% of data is used for training. Remaining is used for the test.
He builds a linear regression model. He uses a statistical package to create the model. The model creates a linear equation that expresses price of the car as a function of engine size.
Following are the answers to the questions:
Yes, there is a relationship.
The correlation coefficient is 0.872 => There is a strong relationship.
A straight line can fit => A decent prediction of price can be made using engine size.
Yes, car price can be estimated based on engine size.
Fernando now wants to build a linear regression model that will estimate the price of the car price based on engine size. Superimposing the equation to the car price problem, Fernando formulates the following equation for price prediction.
price = β0 + β1 x engine size
Recall the earlier discussion, on how the data needs to be split into training and testingset. The training data is used to learn about the data. The training data is used to create the model. The testing data is used to evaluate the model performance.
Fernando splits the data into training and test set. 75% of data is used for training. Remaining is used for the test. He builds a linear regression model. He uses a statistical package to create the model. The model produces a linear equation that expresses price of the car as a function of engine size.
He splits the data into training and test set. 75% of data is used for training. Remaining is used for the test.
He builds a linear regression model. He uses a statistical package to create the model. The model creates a linear equation that expresses price of the car as a function of engine size.
The model estimates the parameters:
The linear equation is estimated as:
price = -6870.1 + 156.9 x engine size
The model provides the equation for the predicting the average car price given a specific engine size. This equation means the following:
One unit increase in engine size will increase the average price of the car by 156.9 units.
The model is built. The robustness of the model needs to be evaluated.
How can we be sure that the model will be able to predict the price satisfactory?
This evaluation is done in two parts. First, test to establish the robustness of the model. Second, test to evaluate the accuracy of the model.
Fernando first evaluates the model on the training data. He gets the following statistics.
There are a lot of statistics in there. Let us focus on key ones (marked in red). Recall the discussion on hypothesis testing. The robustness of the model is evaluated using hypothesis testing.
H0 and Ha need to be defined. They are defined as follows:
β1: The value of β1 determines the relationship between price and engine size. If β1 = 0 then there is no relationship. In this case, β1 is positive. It implies that there is some relationship between price and engine size.
t-stat: The t-stat value is how many standard deviations the coefficient estimate (β1) is far away from zero. Further, it is away from zero stronger the relationship between price and engine size. The coefficient is significant. In this case, t-stat is 21.09. It is far enough from zero.
p-value: p-value is a probability value. It indicates the chance of seeing the given t-statistics, under the assumption that NULL hypothesis is true. If the p-value is small e.g. < 0.0001, it implies that the probability that this is by chance and there is no relation is very low. In this case, the p-value is small. It means that relationship between price and engine is not by chance.
With these metrics, we can safely reject the NULL hypothesis and accept the alternate hypothesis. There is a robust relationship between price and engine size
The relationship is established. How about accuracy? How accurate is the model? To get a feel for the accuracy of the model, a metric named R-squared or coefficient of determination is important.
R-squared or Coefficient of determination: To understand these metrics, let us break it down into its component.
Let us look at an example.
In the example above, RSS is computed based on the predicted price for three cars. RSS value is 41450201.63. The mean value of the actual price is 11,021. TSS is calculated as 44,444,546. R-squared is computed as 6.737%. For these three specific data points, the model is only able to explain 6.73% of the variation. Not good enough!!
However, for Fernando’s model, it is a different story. The R-squared for the training set is 0.7503 i.e. 75.03%. It means that the model can explain more 75% of the variation.
Voila!! Fernando has a good model now. It performs satisfactorily on the training data. However, there is 25% of data unexplained. There is room for improvement. How about adding more independent variable for predicting the price? When more than one independent variables are added for predicting a dependent variable, a multivariate regression model is created i.e. more than one variable.
The next installment of this series will delve more into the multivariate regression model. Stay tuned.
This post was first published in here.
Comment
Simple and easy to understand.
Thanks!
In the first graph, you set price on the horizontal axis and set engine size on the vertical axis.
I thinks it is better to set the dependent valuable on the horizontal axis and set the independent valuable on the horizontal one in a scatter diagram.
good presentaion
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central