Linear Regression is one of the most widely used statistical models. If Y is a continuous variable i.e. can take decimal values, and is expected to have linear relation with X's variables, this relation could be modeled as linear regression, mostly the **first** model to fit,if we are planning to develop a model of forecasting Y or trying to build hypothesis about relation Xs on Y.

The general approch is to understand the theory based on principle of "minimum" square error and we derive the solution using minimization of functions through calculus,however it has a nice geometric intuition, if we use the tricks or methods related to solving an over-determined system

The objective of the linear regression is to express a dependent variable in terms of linear function of independent variables, if we have one independent variable, we can it simple (some call it uni-variate) linear regression or single variable linear regression and when we have many, we call it multiple linear regression, it is NOT **multi-variate**, the multi-variate linear regression refers to when we have more than one decedent variables.

Without much of mathematics and symbols, the idea of linear regression could be explored through a scatter plot.We generate a sample of X and Y from normal distribution.

Linear Regression is about fitting a straight line from the scatter plot,key challenge here **what constitutes **a best fit line in other words what would be **best values** of and . The general idea is to find a line ( its coefficients) such that **total error** is at the minimum. There is a standard explanation that we need to minimize the **total square error**, which means we have to solve a minimization problem to solve optimal values of the coefficients. Obviously this method involves quite a lot of mathematics or calculus etc. which would not provide any institution or illustration, instead we will use a little of vector algebra and associated geometry to build the intuition about the solution.

Let us assume we have 100 data points,let us assume that our solution has to satisfy all the data points, which means …

We can write the above system of equations in the matrix notation:

We have 100 equations and 2 unknowns,clearly we have a problem we cannot solve this system to have an **unique solution**. This type of system is called **over determined** system. In the next, let us go deep into the **over determined** systems see what could be an approximate solutions, which is key to solving linear regression problem.

Let us consider two small over determined systems and check how we can get an unique solution.

If we look at the row-wise graph of the systems – we can conclude that over determined systems generally do not have a solution, it is rare that we will have a solution to that system. Therefore, we need to look for an approximate solution, process of approximate solution would clear if we explore column wise interpretation of the system.

3. Column Interpretation of Over Determined System

The columns wise system is shown below.

We have plotted the an over-determined system with no solution in the above diagram, clearly vector **b** of the Ax=b system is not in the columns space of A.Therefore, we cannot have solution,to have solution, b must on the columns space of A. A way out could be approximate **b** in the columns space as per the diagram above, which we explain in the next figure.

As discussed earlier we can approximate **b** in the columns space of A, as per the diagram above, which means we make an orthogonal projection of **b** on the plane which contains both the columns of A. We can construct a new vector (as marked by blue line call it B ) in the diagram. Intuitively the new system that is sitting on the plane containing columns of X could be thought as an **approximate** solution to the original system. This approximation method of solving the system is known as **least square method** of solving an over determined system. This idea is key to solving linear regression problem.

In the next section we would extend the idea generated above to establish the following:

- How to obtain an approximate solution of the over-determined system, which is the solution for linear regression problem.
- Why it is called Least Square
- What extra insight we get about regression solution from the column space or observation space visualization.

A typical linear regression problem is like solving an over-determined systems of equations. To have good geometric exposition, we have changed the original problem as multiple regression written in **mean deviation form**. First record of the changed system is shown below:

This change is not significant,this has been done make sure our diagram is in conformity with equations written above. With this formulation we depict the regression problem by the following diagram. The diagram below would be valid in generic sense even if we do not write the system in mean deviation form.

We have reproduced the regression problem in **observation space** or **columns space**,a detail process of drawing this diagram is described . The most important point is Y vector has been decomposed into two orthogonal component **vectors**,we can write:

It is obvious from the diagram,**error** or **residual** is orthogonal to both **X1 and X2**. Using the matrix-vector product notation we can write:

or

If we analyze the above equation – we would see that this is 2X2 system, therefore,we can solve the system and it will have an **unique solution**. This form of equation is known as **Normal Equations** – as we **residual ** is normal to regression plane. A more compact form would be:

One point we need to clarify why the method described above is called **Least Square Method** .. let us develop the concept why it is called so .

In the expression above,we have developed a relation between **total sum of squares** of the residuals,basically we are collecting residual from each observation,then square and sum…now if we want to tweak it as minimization problem,we ask the question which value of b would minimize the sum of square of residuals??. A little bit calculus would prove that first order condition of the minimization would be lead to the normal equations of above – which means we will get to the same solution, **therefore** the above method could be called **Least Square Solution** for over determined systems of equations, which also the solution approch for linear regression problem.

In addition to the solution approch, many others statistical relations, concepts and lemmas becomes obvious from the picture above, which otherwise requires tedious symbol manipulations.

- Total Sum of Squares = Regression Sum of Squares + Residual Sum of Square.
**Apply Pythagoras theorem on triangle OYD.** - measures the Goodness of Fit varies between 0 and 1.
**A little manipulation of the Sum of Squares**relation will get us there. - Cov(,e) = 0, We know dot product between the vectors are zero, which means co-variance is zero.
- Residual and independent variables are un-correlated. We can extend the orthogonality relation from the picture.
- Effect on multiplying X1 by factor k. From the diagram, regression co-eff is the ratio between OA : OX1, if we extend the X1, this ratio will change. So effect will depend on the value of k.

..and many others if we extend this concept further to apply vector space, span and dimension etc. on the diagram above, which would be the topic of another blog.

Another important point that **uniqueness** of the solution and few other insights are dependent on the fact matrix X is healthy i.e. columns are independent. Independence of columns would ensure X’X is invertible. A lot scenarios might occur where columns of X are not independent,and/or matrix X lacks **good** properties which would be the matter of discussion under least square computations.

Views: 10750

Tags: column, linear, orthogonal, over-determined, projection, regression, space, system

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central