# The importance of completeness of linear regressions is an often-discussed issue. By leaving out relevant variables the coefficients might be inconsistent.

But why on earth?!

Assuming a linear complete model of the form:

z = a + bx + cy + ε.

Where z is supposed to be dependent, x and y are independent and ε is the error term.

Now we drop y to check which terms are affected. By reducing one dimension we transform a linear hyperplane to a linear line. In the initial three-dimensional space this two-dimensional line (incomplete model) is located in the center of y. More precisely, at ȳ which is the mean of y. This leads to a correction of "a" and ε – if y is left out.

Starting from the initial estimated model (without ε) we get "a" with x = 0 and y = 0. To obtain the new intercept (α), "a" must be extended from y = 0 to y = ȳ with:

α = a + cȳ.

For the residuals ε the contribution (regarding the explanatory power) of y disappears. This leads to an increasing error-term (u):

u = ε + c(y - ȳ).

So, the incomplete model

z = α + bx + u

consists of

z = a + cȳ + bx + ε + c(y - ȳ)

Dissolving the parentheses leads to the initial model z.

Assumed there is a correlation between x and y:

For the initial (complete) model, this is not a problem regarding its consistency. However, multicollinearity can cause variance inflation. But for the incomplete model there will be a correlation among the independent variable x and the residuals - which can end up in inconsistency.

Thus, if there is no correlation between the omitted variable(s) and the contained variable(s) in the model there is no problem regarding the consistency. Except the endogeneity comes from errors of measurement or reverse causality. But this is another story…

Views: 934

Tags: bias, omitted, variable

Comment

Join Data Science Central