.

# The best kept secret about linear and logistic regression

Update: The most recent article on this topic can be found here

All the regression theory developed by statisticians over the last 200 years (related to the general linear model) is useless. Regression can be performed as accurately without statistical models, including the computation of confidence intervals (for estimates, predicted values or regression parameters). The non-statistical approach is also more robust than theory described in all statistics textbooks and taught in all statistical courses. It does not require Map-Reduce when data is really big, nor any matrix inversion, maximum likelihood estimation, or mathematical optimization (Newton algorithm). It is indeed incredibly simple, robust, easy to interpret, and easy to code (no statistical libraries required).

Here we will address linear regression. Logistic regression can be turned into a linear regression problem using basic variable transformations, so the principles presented in this article also apply to logistic regression. In a nutshell, approximate solutions produce more robust predictions, and the loss in accuracy is so small, it is smaller than the noise present in the data, in most cases. We hope that the solution described here will be of interest to all data scientists, software engineers, computer scientists, business analysts and other professionals who do not have a formal statistical training or lack proper tools (software). In addition, the solution offered here does not require fine-tuning mysterious parameters.

1. My solution applied to beautiful data

Beautiful data is when your data set is clean and meets all the conditions needed to make linear regression work well: in particular, the independent variables (sometimes referred to as features) are not correlated. This rarely happens in practice, but this is our starting point. Realistic situations are described in the next section.

Without loss of generality, we assume that the variables are centered, that is, their means are equal to zero. As a result, we are discussing a regression where intercept = 0, that is

Y = a_1 * X_1 + ... + a_n * X_n + noise,

where

• the a_i's (i=1,...,n) are the regression coefficients,
• the X_i's are the independent variables or features
• noise is a white noise
• Y is the response
• conditions required by traditional regression are satisfied

For beautiful data, we also assume that the independent variables are not correlated, that is cov(X_i, X_j) = 0 if i < j.

Then the solution is straightforward:

a_i = cov(Y, X_i) / var(X_i), i = 1, ..., n.

Note that the regression coefficients have the same sign as the respective correlations between the associated independent variables and the response.

2. My solution applied to ugly data

When independent variables are correlated - which is the case with ugly data, the typical type of data that we are all dealing with in practice - traditional linear regression does not work as well: it is subject to over-fitting, numerical instability and the fact that you can have many very different sets of regression coefficients that produce the same results.

My solution in this case is still the same:

a_i = M * cov(Y, X_i) / var(X_i), i =1, ..., n.

where M is a multiplier (a real number), chosen to minimize the variance of Z = Y - (a_1 * X_1 + ... + a_n * X_n). Note that Z = noise.

Remember that since we are working with centered variables, the mean value for the response, as well as for noise and independent variables, is zero. So multiplying by M has no effect on means. In the previous section (beautiful data), the multiplier is actually M = 1. There's a simple formula for M:

M = { Y' (SUM b_i* X_i) } / [ (SUM b_i * X'_i) * (SUM b_i * X_i) ],

where

• Y and the X_i's  are assumed to be column vectors,
• ' denotes the matrix transpose operator, and
• b_i = cov(Y, X_i) / var(X_i).

Dealing with non centered data

If we are not dealing with centered variables, we must add an intercept to the model, and the final estimate becomes W* = c + W, where c = Mean(Y - W) and W = a_1 * X_1 + ... + a_n * X_n.  This way, you remove the bias: Mean(W*) = Mean(Y). The term c is called the intercept. Also, M must be computed on the centered data (after subtracting the mean both from the dependent and independent variables.) An example in Excel, with all the computations, for non-centered data, is provided here

Confidence intervals

Confidence intervals can be computed by introducing noise in the data, and see how it impacts the estimated regression coefficients. Also note that even in the worst case scenario where all the independent variables are strongly correlated, my solution will work well and provide an optimum.

Goodness of fit

The quality of my approximate solution can be measured in different ways. Let's define

• S = a_1 * X_1 + ... + a_n * X_n where the a_i's are estimated using my method, and
• T = a_1 * X_1 + ... + a_n * X_n where the a_i's are estimated via traditional regression.

The following metrics measure how good my approximation is:

1. Correlation r = correl(S, T): the closer to 1, the better. A value above 0.90 is pretty good.
2. Correlation ratio c = Correl(S, Y) / correl(T, Y): the closer to 1, the better. A value above 0.95 is good. I assume that correl(S, Y) is always less than correl(T, Y), otherwise it would mean that my approximate solution is better than the exact solution, in the sense that my predicted values are closer (more correlated) to observed values (the response) than the predicted values obtained via traditional regression are.
3. Variance ratio d = var(Y-T) / var(Y-S). The closer to 1, the better the approximation. Note that var(Y-T) is always less than var(Y-S), otherwise it would mean that my approximate solution is better than the exact solution, in the sense that it achieves more variance reduction than traditional regression. This is impossible.

3. Practical example

I generated a data set with 10,000 observations and four independent variables denoted as x, y, z, v in my program and excel spreadsheet. I have introduced cross-correlations among the independent variables to make it a realistic example. Note that in this case, because the data is simulated, we know the exact theoretical solution. The exact theoretical solution and the one provided by traditional linear regression are almost as good, their predictive power is almost identical, as measured using the three metrics described in the previous section.

My approximate solution is very good too, though not as good as the traditional regression. I went one step further: not only I used my approximate solution, but I computed the regression coefficients using only the first 100 observations, that is, only 1% of the simulated data set. The result is still very good. Here's the summary (the details are found in the attached spreadsheet):

• True theoretical regression coefficients: 1.00, 0.50, -0.30, -2.00
• Regression coefficients obtained via traditional regression: 0.43, -0.38, 0.08, -2.02
• My approximation: 0.95, 0.05, -0.36, -1.46

Note that the three solutions (the three sets of regression coefficients) are very different, though the sum of the four regression coefficients is close to -0.85 in each case. But in each case, the estimated (predicted) values have similar correlation to the response: 0.69 for my approximation, and 0.72 for both theoretical and traditional solution. So in this case c = 0.69 / 0.72 = 0.96, pretty good. We also have r = 0.94. Finally, d = 0.89; both the traditional regression and my approximation manage to reduce the variance by approximately a factor 2.

My approximate solution has the advantage to be more stable (can't overfit) and easier to interpret, since the regression coefficients have the same sign as the correlation between response and associated variable.

open(OUT,">rand1.txt");
for (\$k=0; \$k<10000; \$k++) {
\$x=2*(rand()-0.5);
\$y=2*(rand()-0.5);
\$z=2*(rand()-0.5);
\$u=2*(rand()-0.5);

\$v1=0; \$v2=0;
for (\$m=0; \$m<10; \$m++) {
\$v1+=rand();
\$v2+=rand();
}

\$noise=\$v1-\$v2; # simulate white, Gaussian noise

\$response=(1*\$x + 0.5*\$y - 0.3*\$z - 2*\$u) + \$noise;
\$theoretical_estimate= 1*\$x + 0.5*\$y - 0.3*\$z - 2*\$u;

print OUT "\$response\t\$x\t\$y\t\$z\t\$u\t\$noise\t\$theoretical_estimate\n";
}
close(OUT);

5. Next steps

More tests are needed to see when my solution really underperforms, especially in higher dimensions. Also, the traditional linear regression is performed in the Excel spreadsheet, using the Linest function. This computation should be moved to the program. Note that even to simulate the white Gaussian noise in the data, I did not use any statistical libraries. See program for details, in previous section.

Another research topic is to find out how sensitive the traditional regression coefficient are, compared to the coefficients obtained via ny method. One way to compare sensitivity is to introduce noise in the data and see how it impacts the regression coefficients. Another way is to reshuffle the white noise 1,000 times and see the impact on coefficients. Or you could also slice the 10,000 observations into 10 samples of 1,000 observations each, and compute the regression coefficients for both methods, in each sample, then use the Analyticbridge first theorem to compute confidence intervals.

Finally, a better solution, when the number of variables is larger than 10, might be to have two (or more) parameters: M, and M', where M is applied to a subset of variables, and M' to the remaining variables. This works if the independent variables are centered (if their means, after appropriate transformation, are zero). In our example with n=4 variables, one parameter M (a real number, not a matrix) works well. I suspect that with n=10 variables, two parameters work better. Such a solution might involve looking at all 2^n groupings of two variables (variables attached to M, and variables attached to M') for optimization, and for each grouping, compute the optimum M and M' by vanishing the first derivatives of var(Z) with respect to M and M'. This is just an easy generalization of what I did in the case where we use only one parameter M. More generally, I suspect that the optimum number of parameters might be of the order log(n), and also dependent on the data, when we have a large number of variables. More on this later.

Views: 51508

Tags: logistic regression

Comment

Join Data Science Central

Comment by Jean-Marc Patenaude on March 17, 2014 at 7:56am

There are many ways to model a data set, and of course, no model fits the data perfectly by the very definition of the word "model".

The real question should not be whether regression models are good or bad, as they can be either depending on the modeling objectives.  Rather,the question should be what mathematical properties and limitations does our model exhibit, and is that okay for the application?

Specifically, it is fundamental to understand the mathematical properties that make up the model's behavior such as a traditional regression analysis.  Traditional regression is a linear optimization exercise that uses a cost function (an objective function in optimization jargon) that is a 2-norm (the mean square errors in statistics jargon).  Minimizing a 2-norm cost function exhibits certain (often misunderstood) behaviors due to the squaring of the errors, namely:

1)  It tries really hard to fit the outliers to the model because the error terms are squared, exacerbating the outlier effect on the regression slope.  This makes the model highly sensitive to outlier data points thus creating a form of numerical instability.  For example, add an outlier point to your data set and your slope is guaranteed to change by a non trivial amount.

2) On the other hand, it also under emphasizes the small errors (squaring an error of 0.1 = 0.01 for example).  So we're left with a regression line that is often overly fitted by the outlier data points relative to the points close to the regression line which are given little relative weight.

I want to emphasize that this isn't necessarily a bad thing for many (most?) applications.  But it IS absolutely the mathematical behavior of any least squares fit / optimization problem by virtue of its properties.

Other cost functions can also be used in a regression model that are not as sensitive to outliers.  For example, one often used is the 1-norm  e.g. the absolute value of the errors rather than the square of the errors.  This creates a linear relationship for each data point and therefore greatly reduces the issue of outlier instability mentioned above.

Once again, the real question is what kind of model is appropriate for our application, and what mathematical properties & limitations does the model have and can we live with. The issue I often see is that people blindly use traditional least-squares regression without appreciating its mathematical properties and limitations, then wonder why it behaves funny when new data comes in.  As is always the case with any modeling exercise, user beware and know your maths...

Comment by Vincent Granville on March 16, 2014 at 10:45am

Scott Nickleach

Statistical Consultant at Equifax

The magnitude of your largest pairwise correlation is 0.3. The average pairwise correlation is about 0.15. I'd pretty much consider such a structure negligible. Try your approach when the average pairwise correlation is about 0.75. You also seem to imply that for logistic regression, you one can just fit the portion of the model that is linear (i.e. log-odds ratio as a linear function of X), and then do basic transformations. I guess by basic transformations, you are implying use of the inverse-logit transformation to obtain the estimated probabilities. Of course one cannot do that. Among other standard results, Jensen's Inequality can help to remind us why.

Vincent Granville

Co-Founder, Big Data Scientist, DSC Network

Regarding pair-wise correlations in my example, sure they are a little weak. If you boost them to 0.75, then classical regression is terrible, my approximation won't be as good, and my recommended course of action consists in using my solution outlined in section 5, that is using M and M', which essentially consists in clustering the variables into two clusters and apply a different M to each cluster. Within a cluster, variables are highly correlated. Indeed, in this case, decision trees is a better approach than regression, be it classical or my style.

Regarding logistic regression, I know that the transformation that I suggested does not produce the same results as MLE. They should be close enough though, I have to look at that in more details. Not that ML estimates are better than non ML estimates anyway, it depends of course how you define "better".

Comment by Vincent Granville on March 15, 2014 at 8:56pm

At least even ASA (the American Statistical Association ) recognizes that the FNP paradigm (Fisher-Neyman-Pearson), even the Maximum Likelihood princples, are deeply wrong.

Comment by Gary D. Miner, Ph.D. on March 15, 2014 at 11:12am

SPOT ON, Vincent ........the "old school statisticians" just will not give up!!!!! .......I was reading a FED GRANT PROPOSAL recently where "statisticians" wanted MILLIONS OF DOLLARS to buy a few laptops, and then "pay their salaries" for several years to DEVELOP A COMPLETELY NEW STATISTIC (but they were so "out to lunch" probably being "cloistered in their TOWER" -- Is it not reputed to be made out of "ivory", that commodity that is on the BLACK LIST in worldwide trade? -- But anyway, what they were proposing as "new' was a 'RE-INVENTING OF THE WHEEL'. All they needed to do was "wake up", see what the REAL WORLD / REAL SOCIETY is doing, and use the already developed modern MACHINE LEARNING / PREDICTIVE ANALYTICS to very rapidly answer their "use case" questions. NO NEW STATISTIC was needed !!! .... Only EDUCATION of the ACADEMICS !!!!!........... This is a crazy time we live in, when TECHNOLOYG is waiting for us to use it, but the "users" are decades behind in understanding ............a lot of "catching up is needed, I believe ...... (Or maybe a "new type of education system" ?) ..........

Comment by Vincent Granville on March 14, 2014 at 9:15am

Mike, the counters v1 and v2 are reset to 0 at each iteration. The noise should have the same pattern at the beginning, middle or end of the main iteration (to produce the 10,000 obs).

Comment by Mike Olson on March 14, 2014 at 9:00am

The way you generate the Gaussian noise means that by using the first 100 samples in the data, you've likely minimized the amount of noise in the subset as opposed to taking 100 samples from the larger set using a uniform distribution.  The reason is that the maximum possible drift between the Brownian motion variables v1 and v2 increases with each sample.  So, the magnitude of the Gaussian noise should actually be correlated with the sample number.  I'm curious if that has skewed the results any.

Comment by Neep Hazarika on March 14, 2014 at 1:57am

Statistics is NOT science, no matter how hard they try, using buzzwords such as Bayesian, Frequentist, etc.

Comment by Sune Karlsson on March 14, 2014 at 12:14am

This attempts to answer a different question. Regression is designed to answer the question "What happens to y if I change one x-variable, holding the other x-variables constant". Your approach attempts to answer the question "What happens to y, on average, if I change one x-variable, averaging over the x-variables".

My guess that introducing the constant M leads to a bias in the answer to the second question. A simple regression with just one x-variable would probably work better.

If a method produces a good fit or not is important but secondary to if it produces an answer to the right question. Both questions can be right, depending on context.