In the last article of this series, we discussed the story of Fernando. A data scientist who wants to buy a car. He uses Simple Linear Regression model to estimate the price of the car.
The regression model created by Fernando predicts price based on the engine size.One dependent variable predicted using one independent variable.
The simple linear regression model was formulated as:
price = β0 + β1 x engine size
The statistical package computed the parameters. The linear equation is estimated as:
price = -6870.1 + 156.9 x engine size
The model was evaluated on two fronts:
Recall that the metric R-squared explains the fraction of the variance between the values predicted by the model and the value as opposed to the mean of the actual. This value is between 0 and 1. The higher it is, the better the model can explain the variance. The R-squared for the model created by Fernando is 0.7503 i.e. 75.03% on the training set. It means that the model can explain more than 75% of the variation.
However, Fernando wants to make it better.
Fernando decides to enhance the model by feeding the model with more input data i.e. more independent variables. He has now entered into the world of the multivariate regression model.
Linear regression models provide a simple approach towards supervised learning. They are simple yet effective.
Recall that linear implies the following: arranged in or extending along a straight or nearly straight line. Linear suggests that the relationship between dependent and independent variable can be expressed in a straight line.
The equation of the line is y = mx + c. One dimension is y-axis, another dimension is x-axis. It can be plotted in a two-dimensional plane. It looks something like this:
The equation of line is y = mx + c. One dimension is y-axis, another dimension is x-axis. It can be plotted in a two-dimensional plane. It looks something like this:
The generalization of this relationship can be expressed as:
y = f(x).
It doesn’t mean anything fancy. All it means is:
Let us define y as a function of x. i.e. define the dependent variable as a function of the independent variable.
What if the dependent variable needs to be expressed in terms of more than one independent variable? The generalized function becomes:
y = f(x, z) i.e. express y as some function/combination of x and z.
There are three dimensions now y-axis, x-axis and z-axis. It can be plotted as:
Now we have more than one dimension (x and z). We have an additional dimension. We want to express y as a combination of x and z.
For a simple regression linear model a straight line expresses y as a function of x. Now we have an additional dimension (z). What will happen if an additional dimension is added to a line? It becomes a plane.
The plane is the function that expresses y as a function of x and z. Extrapolating the linear regression equation, it can now be expressed as:
y = m1.x + m2.z+ c
This is the genesis of the multivariate linear regression model. There are more than one input variables used to estimate the target. A model with two input variables can be expressed as:
y = β0 + β1.x1 + β2.x2
Let us take it a step further. What if we had three variables as inputs? Human visualization capabilities are limited here. It can only visualize three dimensions. In machine learning world, there can be many dimensions. A model with three input variables can be expressed as:
y = β0 + β1.x1 + β2.x2 + β3.x3
A generalized equation for the multivariate regression model can be:
y = β0 + β1.x1 + β2.x2 +….. + βn.xn
Now that there is familiarity with the concept of a multivariate linear regression model let us get back to Fernando.
Fernando reaches out to his friend for more data. He asks him to provide more data on other characteristics of the cars.
The following were the data points he already had:
He gets additional data points. They are:
Fernando now wants to build a model that predicts the price based on the additional data points.
The multivariate regression model that he formulates is:
Estimate price as a function of engine size, horse power, peakRPM, length, width and height.
=> price = f(engine size, horse power, peak RPM, length, width, height)
=> price = β0 + β1. engine size + β2.horse power + β3. peak RPM + β4.length+ β5.width + β6.height
Fernando inputs these data into his statistical package. The package computes the parameters. The output is the following:
The multivariate linear regression model provides the following equation for the price estimation.
price = -85090 + 102.85 * engineSize + 43.79 * horse power + 1.52 * peak RPM — 37.91 * length + 908.12 * width + 364.33 * height
The interpretation of multivariate model provides the impact of each independent variable on the dependent variable (target).
Remember, the equation provides an estimation of the average value of price.Each coefficient is interpreted with all other predictors held constant.
Let us now interpret the coefficients.
The model is built. It is interpreted. Are all the coefficients important? Which ones are more significant? How much variation does the model explain?
The statistical package provides the metrics to evaluate the model. Let us evaluate the model now.
Recall the discussion on the definition of t-stat, p-value, and coefficient of determination. Those concepts apply in multivariate regression models too. The evaluation of the model is as follows:
Recall the discussion of how R-squared help to explain the variations in the model. When more variables are added to the model, the r-square will not decrease. It only increases. However, there has to be a balance. Adjusted R-squared strives to keep that balance. The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared compensates for the addition of variables and only increases if the new term enhances the model.
Based on these evaluations, Fernando concludes the following:
Fernando has a better model now. However, he is perplexed. He knows that length of the car doesn’t impact the price.
How can one select the best set of variables for model building? Is there any method to choose the best subsets of variables?
In the next part of this series, we will discuss variable selection methods.
This article was first published at here.