In the last few blog posts of this series, we discussed simple linear regression model. We discussed multivariate regression model and methods for selecting the right model.
Fernando has now created a better model.
price = -55089.98 + 87.34 engineSize + 60.93 horse power + 770.42 width
Fernando contemplates the following:
In this article will address that question. This article will elaborate about Log-Log regression models.
To explain the concept of the log-log regression model, we need to take two steps back. First let us understand the concept of derivatives, logarithms, exponential. Then we need understand the concept of elasticity.
Let us go back to high school math. Meet derivatives. One of most fascinating concepts taught in high school math and physics.
Derivate is a way to represent change – the amount by which a function is changing at one given point.
A variable y is a function of x. Define y as:
y = f(x)
dy/dx = df(x)/dx = f’(x)
This means the following:
Isn’t it that Fernando wants? He wants to know the change in price (y) with respect to changes in other variables (cityMpg and highwayMpg).
Recall that the general form of a multivariate regression model is the following:
y = β0 + β1.x1 + β2.x2 + .... + βn.xn + ε
Let us say that Fernando builds the following model:
price = β0 + β1 . engine size i.e. expressing price as a function of engine size.
Alas, it is not that simple. The linear regression model assumes a linear relationship. The Linear relationship is defined as:
y = mx + c
If the derivative of y over x is computed, it gives the following:
dy/dx = m . dx/dx + dc/dx
The equation now becomes:
dy/dx = m
Now let us look at exponential. This character is again a common character in high school math. An exponential is a function that has two operators. A base (b) and an exponent (n). It is defined as bn. it takes the form:
f(x) = bx
The base can be any positive number. Again Euler’s number (e) is a common base used in statistics.
Geometrically, an exponential relationship has following structure:
The logarithm is an interesting character. Let us only understand its personality applicable for regression models. The fundamental property of a logarithm is its base. The typical base of the logarithm is 2, 10 or e.
Let us take an example:
The logarithm of 8 with base 2 is 3
There is another common base for logarithms. It is called as “Euler’s number (e).” Its approximate value is 2.71828. It is widely used in statistics. The logarithm with base e is called as Natural Logarithm.
It also has interesting transformative capabilities. It transforms an exponential relation into a linear relation. Let us look at an example:
The diagram below, shows an exponential relationship between y and x:
If logarithms are applied to both x and y, the relationship between log(x) and log(y) is linear. It looks something like this:
Elasticity is the measurement of how responsive an economic variable is to a change in another.
Say that we have a function: Q = f(P) then the elasticity of Q is defined as:
E = P/Q x dQ/dP
Now let us bring these three mathematical characters together. Derivatives, Logarithms and Exponential. Their rules of engagement are as follows:
Let us take an example. Imagine a function y expressed as follows:
So does it mean for linear regression models? Can we do mathematical juggling to make use of derivatives, logarithms, and exponents? Can we rewrite the linear model equation to find the rate of change of y wrt change in x?
First, let us define relationship between y and x as an exponential relationship
Now that we understand the concept, let us see how Fernando build a model. He builds the following model:
log(price) = β0 + β1. log(engine size) + β2. log(horse power) + β3. log(width)
He wants to estimate the change in car price as a function of the change in engine size, horse power, and width.
Fernando trains the model in his statistical package and gets the following coefficients.
The equation of the model is:
log(price) = -21.6672 + 0.4702.log(engineSize) + 0.4621.log(horsePower) + 6.3564 .log(width)
Following is the interpretation of the model:
Fernando has now built the log-log regression model. He evaluates the performance of the model on both training and test data.
Recall, that he had split the data into the training and the testing set. The training data is used to create the model. The testing data is the unseen data. Performance on testing data is the real test.
The transformation is treating the log(price) as an exponent to the base e.
e^log(price) = price
The last few posts have been quite a journey. Statistical learning laid the foundations. Hypothesis testing discussed the concept of NULL and alternate hypothesis. Simple linear regression models made regression simple. We then progressed into the world of multivariate regression models. Then discussed model selection methods. In this post, we discussed the log-log regression models.
So far the regression models built had only numeric independent variables. The next post we will deal with concepts of interactions and qualitative variables.