Subscribe to DSC Newsletter

The best kept secret about linear and logistic regression

Update: The most recent article on this topic can be found here

All the regression theory developed by statisticians over the last 200 years (related to the general linear model) is useless. Regression can be performed as accurately without statistical models, including the computation of confidence intervals (for estimates, predicted values or regression parameters). The non-statistical approach is also more robust than theory described in all statistics textbooks and taught in all statistical courses. It does not require Map-Reduce when data is really big, nor any matrix inversion, maximum likelihood estimation, or mathematical optimization (Newton algorithm). It is indeed incredibly simple, robust, easy to interpret, and easy to code (no statistical libraries required).

Here we will address linear regression. Logistic regression can be turned into a linear regression problem using basic variable transformations, so the principles presented in this article also apply to logistic regression. In a nutshell, approximate solutions produce more robust predictions, and the loss in accuracy is so small, it is smaller than the noise present in the data, in most cases. We hope that the solution described here will be of interest to all data scientists, software engineers, computer scientists, business analysts and other professionals who do not have a formal statistical training or lack proper tools (software). In addition, the solution offered here does not require fine-tuning mysterious parameters.

1. My solution applied to beautiful data

Beautiful data is when your data set is clean and meets all the conditions needed to make linear regression work well: in particular, the independent variables (sometimes referred to as features) are not correlated. This rarely happens in practice, but this is our starting point. Realistic situations are described in the next section.

Without loss of generality, we assume that the variables are centered, that is, their means are equal to zero. As a result, we are discussing a regression where intercept = 0, that is

Y = a_1 * X_1 + ... + a_n * X_n + noise,

where

  • the a_i's (i=1,...,n) are the regression coefficients, 
  • the X_i's are the independent variables or features
  • noise is a white noise
  • Y is the response
  • conditions required by traditional regression are satisfied

For beautiful data, we also assume that the independent variables are not correlated, that is cov(X_i, X_j) = 0 if i < j.

Then the solution is straightforward:

a_i = cov(Y, X_i) / var(X_i), i = 1, ..., n.

Note that the regression coefficients have the same sign as the respective correlations between the associated independent variables and the response.

2. My solution applied to ugly data

When independent variables are correlated - which is the case with ugly data, the typical type of data that we are all dealing with in practice - traditional linear regression does not work as well: it is subject to over-fitting, numerical instability and the fact that you can have many very different sets of regression coefficients that produce the same results.

My solution in this case is still the same:

a_i = M * cov(Y, X_i) / var(X_i), i =1, ..., n.

where M is a multiplier (a real number), chosen to minimize the variance of Z = Y - (a_1 * X_1 + ... + a_n * X_n). Note that Z = noise. 

Remember that since we are working with centered variables, the mean value for the response, as well as for noise and independent variables, is zero. So multiplying by M has no effect on means. In the previous section (beautiful data), the multiplier is actually M = 1. There's a simple formula for M:

 M = { Y' (SUM b_i* X_i) } / [ (SUM b_i * X'_i) * (SUM b_i * X_i) ],

where

  • Y and the X_i's  are assumed to be column vectors,
  • ' denotes the matrix transpose operator, and  
  • b_i = cov(Y, X_i) / var(X_i).

Dealing with non centered data

If we are not dealing with centered variables, we must add an intercept to the model, and the final estimate becomes W* = c + W, where c = Mean(Y - W) and W = a_1 * X_1 + ... + a_n * X_n.  This way, you remove the bias: Mean(W*) = Mean(Y). The term c is called the intercept. Also, M must be computed on the centered data (after subtracting the mean both from the dependent and independent variables.) An example in Excel, with all the computations, for non-centered data, is provided here

Confidence intervals

Confidence intervals can be computed by introducing noise in the data, and see how it impacts the estimated regression coefficients. Also note that even in the worst case scenario where all the independent variables are strongly correlated, my solution will work well and provide an optimum.

Goodness of fit

The quality of my approximate solution can be measured in different ways. Let's define

  • S = a_1 * X_1 + ... + a_n * X_n where the a_i's are estimated using my method, and
  • T = a_1 * X_1 + ... + a_n * X_n where the a_i's are estimated via traditional regression.

The following metrics measure how good my approximation is:

  1. Correlation r = correl(S, T): the closer to 1, the better. A value above 0.90 is pretty good.
  2. Correlation ratio c = Correl(S, Y) / correl(T, Y): the closer to 1, the better. A value above 0.95 is good. I assume that correl(S, Y) is always less than correl(T, Y), otherwise it would mean that my approximate solution is better than the exact solution, in the sense that my predicted values are closer (more correlated) to observed values (the response) than the predicted values obtained via traditional regression are.
  3. Variance ratio d = var(Y-T) / var(Y-S). The closer to 1, the better the approximation. Note that var(Y-T) is always less than var(Y-S), otherwise it would mean that my approximate solution is better than the exact solution, in the sense that it achieves more variance reduction than traditional regression. This is impossible.

3. Practical example

I generated a data set with 10,000 observations and four independent variables denoted as x, y, z, v in my program and excel spreadsheet. I have introduced cross-correlations among the independent variables to make it a realistic example. Note that in this case, because the data is simulated, we know the exact theoretical solution. The exact theoretical solution and the one provided by traditional linear regression are almost as good, their predictive power is almost identical, as measured using the three metrics described in the previous section.

My approximate solution is very good too, though not as good as the traditional regression. I went one step further: not only I used my approximate solution, but I computed the regression coefficients using only the first 100 observations, that is, only 1% of the simulated data set. The result is still very good. Here's the summary (the details are found in the attached spreadsheet):

  • True theoretical regression coefficients: 1.00, 0.50, -0.30, -2.00
  • Regression coefficients obtained via traditional regression: 0.43, -0.38, 0.08, -2.02
  • My approximation: 0.95, 0.05, -0.36, -1.46

Note that the three solutions (the three sets of regression coefficients) are very different, though the sum of the four regression coefficients is close to -0.85 in each case. But in each case, the estimated (predicted) values have similar correlation to the response: 0.69 for my approximation, and 0.72 for both theoretical and traditional solution. So in this case c = 0.69 / 0.72 = 0.96, pretty good. We also have r = 0.94. Finally, d = 0.89; both the traditional regression and my approximation manage to reduce the variance by approximately a factor 2.

My approximate solution has the advantage to be more stable (can't overfit) and easier to interpret, since the regression coefficients have the same sign as the correlation between response and associated variable.

4. Program to simulate realistic data used in this article

open(OUT,">rand1.txt");
for ($k=0; $k<10000; $k++) {
  $x=2*(rand()-0.5);
  $y=2*(rand()-0.5);
  $z=2*(rand()-0.5);
  $z=$z+0.3*$y-0.1*$x; # add cross-correlations
  $u=2*(rand()-0.5);
  $u=$u+0.2*$z-0.4*$y-0.3*$x; # add more cross-correlations

  $v1=0; $v2=0;
  for ($m=0; $m<10; $m++) {
    $v1+=rand();
    $v2+=rand();
  }

  $noise=$v1-$v2; # simulate white, Gaussian noise

  $response=(1*$x + 0.5*$y - 0.3*$z - 2*$u) + $noise;
  $theoretical_estimate= 1*$x + 0.5*$y - 0.3*$z - 2*$u;

  print OUT "$response\t$x\t$y\t$z\t$u\t$noise\t$theoretical_estimate\n";
}
close(OUT);

5. Next steps

More tests are needed to see when my solution really underperforms, especially in higher dimensions. Also, the traditional linear regression is performed in the Excel spreadsheet, using the Linest function. This computation should be moved to the program. Note that even to simulate the white Gaussian noise in the data, I did not use any statistical libraries. See program for details, in previous section.

Another research topic is to find out how sensitive the traditional regression coefficient are, compared to the coefficients obtained via ny method. One way to compare sensitivity is to introduce noise in the data and see how it impacts the regression coefficients. Another way is to reshuffle the white noise 1,000 times and see the impact on coefficients. Or you could also slice the 10,000 observations into 10 samples of 1,000 observations each, and compute the regression coefficients for both methods, in each sample, then use the Analyticbridge first theorem to compute confidence intervals.

Finally, a better solution, when the number of variables is larger than 10, might be to have two (or more) parameters: M, and M', where M is applied to a subset of variables, and M' to the remaining variables. This works if the independent variables are centered (if their means, after appropriate transformation, are zero). In our example with n=4 variables, one parameter M (a real number, not a matrix) works well. I suspect that with n=10 variables, two parameters work better. Such a solution might involve looking at all 2^n groupings of two variables (variables attached to M, and variables attached to M') for optimization, and for each grouping, compute the optimum M and M' by vanishing the first derivatives of var(Z) with respect to M and M'. This is just an easy generalization of what I did in the case where we use only one parameter M. More generally, I suspect that the optimum number of parameters might be of the order log(n), and also dependent on the data, when we have a large number of variables. More on this later.

Other links

Views: 40164

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Vincent Granville on September 22, 2014 at 6:54am

Note: The purpose here is to get regression coefficients that provide similar predictions to the theoretical ones, however "wrong" or (more precisely) "different" these coefficients might be. In the extreme case where all variables are identical, there is an infinite number of theoretical solutions, yet my approach provides the same predictive power on past (training or control) data, and better predictions for future data (or in cross-validations) because it is far more stable. Also note that my approach leads to regression coefficients that are easy to interpret. Traditional regression produces regression coefficients that are very difficult to interpret (when variables are highly correlated), and in the case of identical variables, it produces an infinite number of conflicting sets of meaningless regression coefficients.

Comment by Vincent Granville on September 21, 2014 at 9:08pm

Lisa, the article in question is located at http://magazine.amstat.org/blog/2014/03/01/introductory-statistics/ and amstat.org is ASA. Anyway, ASA's perception of what statistical science is about, is funny at best, sad at worst.

There are numerous notorious articles published about "the death of p-value", including this one.

Comment by Kevin Kautz on April 19, 2014 at 12:03pm

That's a thin line you're treading. I strongly concur with some of the favorable comments of others -- that if we teach statistics by providing regression as an algorithm that always gives you an answer, well, that's highly misleading and many have been misled. Certainly we need to emphasize the fundamental principles that even Pearson started with rather than the model that he ended up with, which was elegant for his moment and application. But your argument is weakened when you suggest that your model should supplant the old standby. Even if your model stands the test of time, it does not benefit to offer "black boxes" that always give answers. Hey... any wrong model, consistently applied, will always give an answer ... just not a useful one. What does indeed stand the test of time is that learning what you're actually doing in statistics is worth more than memorizing models and where to apply them.

Comment by Gennaro Anesi on April 8, 2014 at 3:36pm

surely the OLS model is not "useless", and the variance minimization has lots of practical purposes... i think that the L-2 norm is a nice all-around choice, assuming that the mean value of the set represents the entire dataset (assumption that is met by lots of practical purposes). with large data sets (especially on customer ones) i rather do some prior clustering and then apply a linear model to each cluster - this seems to work well in this area.

but i see a lot of statistics in your text, so i don't believe it's a non-statistical approach (maybe a less theoretical approach, but statistics is - fortunately - not only about theorems nowadays).

Comment by Vincent Granville on April 3, 2014 at 10:46am

Meag: Yes OLS is always the best solution from a theoretical point of view only (minimizing variance). Variance is a poor criteria to start with - actually in the contest I ask, if possible to use a L-1 metric rather than classical L-2 variance. But frequently, OLS is a bad choice in business / practical frameworks, because not robust, numerically unstable and arbitary (difficult to interpret regression coefficient).

For the contest, you are welcome to bring or create your own data sets, I think I mentioned that you need to test it on 20 data sets. The purpose of my Jackknife regression is to do good, robust regression even if your data is crappy, or if you are not a statistician. It's linear regression "for the masses". And because of this, it can be automated without having to worry about lack of fit and other issues. 

Comment by Meag on April 3, 2014 at 10:21am

I see this idea is now a contest.  However, the data it is applied to is not a useful one to show the desired effects.  When the data pulled from a fixed linear relationship with cross correlation between the explanatory variables, classical regression is guaranteed to be optimal in minimizing squared loss.  See an econometric textbook such as Judge at al. for the proofs that OLS offers the Best, Linear, Unbiased Estimator (BLUE) in such cases.  

Other problems:

Can you fix the conjecture M=1 ?  clearly wrong, unless N=1.

How could one address what Steve points out-- when the univariate b_i regression coefficient has a positive sign but the corresponding coefficient in the data generating process is negative, your two step estimation of a_i is biased, and then in order to get unbiased predictions of Y the other coefficients must be biased as well.   

In sum, when the data is drawn according to this relationship 

1*$x + 0.5*$y - 0.3*$z - 2*$u

the two step 'jacknife' is dominated by countless alternatives, irrespective of how much cross-correlation is added.

I think the basic problem is that this statement is misleading:

    When independent variables are correlated ... traditional linear regression does not work as well.

OLS is typically ideal for such cases as long as linearity is upheld and the independent variable candidates are know -- exactly how your data set is constructed.  In the pathological 99% correlation cases, Bayesian/ridge regression techniques or simply throwing away the effectively redundant variables is typically acceptable. 

Finally, to echo Steve, we should never claim something is useless based on a single simulation test. In fact, your simulation is case where the presumed useless statistical method wins the horse race.  Theory actually predicted such as result.  Notch one up for 200 years of statistical research. 

Comment by Steve Simon on March 28, 2014 at 7:40am

So you've decided that a methodology that has worked well for 200 years is "useless"? And your proof that it is useless is a single simulation test of a competing methodology?

Sorry, but I'm not convinced yet. The problem that I see is that you can't accommodate the fairly common case where the univariate regression coefficient is of a different sign than the corresponding coefficient in a multivariate model. M multiplies all of the variables, so it can't flip the sign of just some of the coefficients. Splitting into M and M' doesn't help much, because looking at 2^n groupings is more computationally complex than matrix inversion.

It looks like M works somewhat like a shrinkage parameter, so you might consider how this compares to ridge regression. You might want to compare it to Diagonal Linear Discriminant Analysis, which also avoids the inversion of a large matrix.

I don't mean to sound so negative. There's way too much hype in your post, but your methodology may have merit, if only for very large problems where matrix inversion is impractical.

Steve Simon, www.pmean.com

P.S. Please use something other than Excel for your regression calculations.

http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction...

http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf

http://www.practicalstats.com/xlsstats/excelstats.html

http://homepage.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf

Some of these references are dated, but I have seen no evidence that Microsoft care about or has tried to correct any of the problems noted here.

Comment by Sebastian Sohr on March 23, 2014 at 6:44am

Dear Vincent, I took your article as an incentive to refresh my knowledge on regression.

I came across the following points:

Part 1 (beautiful case).

I think the prerequisites are so strong that your simple solution is identical with the linear regression solution:

Let X = ( X_1... X_n ) . E(X_i) = 0 (expectation), cov(X_i, X_j) for i < j means that the X_i are orthogonal:

X' X = diag ( var(X_1), ..., var(X_n) ) where X' is the transposed matrix and diag(...) denotes the diagonal matrix.

Also, you have X'Y = ( cov(Y, X_1),...,cov(Y, X_n) ).

Taking the linear regression solution a = (X' X)^{-1} X'Y, where (...)^{-1} denotes the inverse matrix, you get exactly

a_i = cov(Y, X_i) / var(X_i).

Part 2.

I think, correlated X_i's are not ugly. They contain observation in such a way that there entries X_j,i (j=1,...,number of observations belong to the same entity (e.g. the same patient) for the same j. These almost always should be correlated. Concluding from this that linear regression does not work sounds a little bit strong for me.

Concerning the multiplier M : It is not 1 in the beautiful case. If you take n = number of obs here, X is invertible, therefore we have for the linear regression solution which minimizes Z

Y - Xa = Y - X(X'X)^{-1}X'Y = X X^{-1}X'^{-1}X'Y = Y - Y = 0.

Another remark:

If your method is correct and delivers the exact solution in the deterministic case (Z = 0), it should also apply to linear equation systems of the form y = Xa. Your formula also delivers the inverse of a matrix X.

All linear algebra textbooks have to be rewritten ;-) .

If I'm wrong somewhere or everywhere, please tell.

Comment by Jean-Marc Patenaude on March 17, 2014 at 11:13am

Vincent, thanks for sharing the link on your article on correlation and the use of the 1-norm to address some of its limitations.  I found it interesting and your point well said regarding how to use it in the analysis of data buckets or clusters. 

The common theme once again is to ensure that one fully understands the mathematical properties and limitations of the model or statistic being considered (such as correlation), and not apply it blindly.  In particular, Pearson's correlation is one of those that people grossly over-abuse without realizing its very important limitations (despite its usefulness, let's not forget it).  To illustrate this, let us all remember Anscombe's quartet...

http://en.wikipedia.org/wiki/Anscombe's_quartet

Comment by Vincent Granville on March 17, 2014 at 8:36am

Jean-Marc, I could not agree more with you. I have to admit, here I've used the lazy approach, which consists of minimizing variance - what statisticians have been doing for a few centuries - that is, using the 2-norm to quickly get an easy solution through simple mathematics / matrix algebra. But the 1-norm is far superior, see my article on 1-norm correlations.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service