Comments - The best kept secret about linear and logistic regression - Data Science Central2017-11-20T13:38:34Zhttps://www.datasciencecentral.com/profiles/comment/feed?attachedTo=6448529%3ABlogPost%3A151731&%3Bxn_auth=noNote: The purpose here is to…tag:www.datasciencecentral.com,2014-09-22:6448529:Comment:2073292014-09-22T15:54:59.494ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p><span><strong>Note</strong>: The purpose here is to get regression coefficients that provide similar predictions to the theoretical ones, however "wrong" or (more precisely) "different" these coefficients might be. In the extreme case where all variables are identical, there is an infinite number of theoretical solutions, yet my approach provides the same predictive power on past (training or control) data, and better predictions for future data (or in cross-validations) because it is far…</span></p>
<p><span><strong>Note</strong>: The purpose here is to get regression coefficients that provide similar predictions to the theoretical ones, however "wrong" or (more precisely) "different" these coefficients might be. In the extreme case where all variables are identical, there is an infinite number of theoretical solutions, yet my approach provides the same predictive power on past (training or control) data, and better predictions for future data (or in cross-validations) because it is far more stable. Also note that my approach<span> </span>leads to regression coefficients that are easy to interpret. Traditional regression produces regression coefficients that are very difficult to interpret (when variables are highly correlated), and in the case of identical variables, it produces an infinite number of conflicting sets of meaningless regression coefficients.</span></p> Lisa, the article in question…tag:www.datasciencecentral.com,2014-09-22:6448529:Comment:2069842014-09-22T06:08:25.241ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p>Lisa, the article in question is located at <a href="http://magazine.amstat.org/blog/2014/03/01/introductory-statistics/" target="_blank">http://magazine.amstat.org/blog/2014/03/01/introductory-statistics/</a> and amstat.org is ASA. Anyway, ASA's perception of what statistical science is about, is funny at best, sad at worst.</p>
<p>There are numerous notorious articles published about "the death of p-value",…</p>
<p>Lisa, the article in question is located at <a href="http://magazine.amstat.org/blog/2014/03/01/introductory-statistics/" target="_blank">http://magazine.amstat.org/blog/2014/03/01/introductory-statistics/</a> and amstat.org is ASA. Anyway, ASA's perception of what statistical science is about, is funny at best, sad at worst.</p>
<p>There are numerous notorious articles published about "the death of p-value", <a href="http://www.datasciencecentral.com/profiles/blogs/p-values-the-gold-standard-of-statistical-validity-are-not-as" target="_blank">including this one</a>.</p> That's a thin line you're tre…tag:www.datasciencecentral.com,2014-04-19:6448529:Comment:1623622014-04-19T21:03:58.429ZKevin Kautzhttps://www.datasciencecentral.com/profile/KevinKautz
<p>That's a thin line you're treading. I strongly concur with some of the favorable comments of others -- that if we teach statistics by providing regression as an algorithm that always gives you an answer, well, that's highly misleading and many have been misled. Certainly we need to emphasize the fundamental principles that even Pearson started with rather than the model that he ended up with, which was elegant for his moment and application. But your argument is weakened when you suggest…</p>
<p>That's a thin line you're treading. I strongly concur with some of the favorable comments of others -- that if we teach statistics by providing regression as an algorithm that always gives you an answer, well, that's highly misleading and many have been misled. Certainly we need to emphasize the fundamental principles that even Pearson started with rather than the model that he ended up with, which was elegant for his moment and application. But your argument is weakened when you suggest that your model should supplant the old standby. Even if your model stands the test of time, it does not benefit to offer "black boxes" that always give answers. Hey... any wrong model, consistently applied, will always give an answer ... just not a useful one. What does indeed stand the test of time is that learning what you're actually doing in statistics is worth more than memorizing models and where to apply them.</p> surely the OLS model is not "…tag:www.datasciencecentral.com,2014-04-09:6448529:Comment:1593482014-04-09T00:36:28.372ZGennaro Anesihttps://www.datasciencecentral.com/profile/GennaroAnesi
<p>surely the OLS model is not "useless", and the variance minimization has lots of practical purposes... i think that the L-2 norm is a nice all-around choice, assuming that the mean value of the set represents the entire dataset (assumption that is met by lots of practical purposes). with large data sets (especially on customer ones) i rather do some prior clustering and then apply a linear model to each cluster - this seems to work well in this area.</p>
<p>but i see a lot of statistics in…</p>
<p>surely the OLS model is not "useless", and the variance minimization has lots of practical purposes... i think that the L-2 norm is a nice all-around choice, assuming that the mean value of the set represents the entire dataset (assumption that is met by lots of practical purposes). with large data sets (especially on customer ones) i rather do some prior clustering and then apply a linear model to each cluster - this seems to work well in this area.</p>
<p>but i see a lot of statistics in your text, so i don't believe it's a non-statistical approach (maybe a less theoretical approach, but statistics is - fortunately - not only about theorems nowadays).</p> Meag: Yes OLS is always the b…tag:www.datasciencecentral.com,2014-04-03:6448529:Comment:1577852014-04-03T19:46:16.028ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p>Meag: Yes OLS is always the best solution from a <em>theoretical</em> point of view only (minimizing variance). Variance is a poor criteria to start with - actually in the contest I ask, if possible to use a L-1 metric rather than classical L-2 variance. But frequently, OLS is a bad choice in business / practical frameworks, because not robust, numerically unstable and arbitary (difficult to interpret regression coefficient).</p>
<p>For the contest, you are welcome to bring or create your…</p>
<p>Meag: Yes OLS is always the best solution from a <em>theoretical</em> point of view only (minimizing variance). Variance is a poor criteria to start with - actually in the contest I ask, if possible to use a L-1 metric rather than classical L-2 variance. But frequently, OLS is a bad choice in business / practical frameworks, because not robust, numerically unstable and arbitary (difficult to interpret regression coefficient).</p>
<p>For the contest, you are welcome to bring or create your own data sets, I think I mentioned that you need to test it on 20 data sets. The purpose of my Jackknife regression is to do good, robust regression even if your data is crappy, or if you are not a statistician. It's linear regression "for the masses". And because of this, it can be automated without having to worry about lack of fit and other issues. </p> I see this idea is now a cont…tag:www.datasciencecentral.com,2014-04-03:6448529:Comment:1580262014-04-03T19:21:07.672ZMeaghttps://www.datasciencecentral.com/profile/Meag
<p>I see this idea is now a contest. However, the data it is applied to is not a useful one to show the desired effects. When the data pulled from a fixed linear relationship with cross correlation between the explanatory variables, classical regression is guaranteed to be optimal in minimizing squared loss. See an econometric textbook such as Judge at al. for the proofs that OLS offers the <span>Best, Linear, Unbiased Estimator</span><span> (</span><span>BLUE</span><span>) in such…</span></p>
<p>I see this idea is now a contest. However, the data it is applied to is not a useful one to show the desired effects. When the data pulled from a fixed linear relationship with cross correlation between the explanatory variables, classical regression is guaranteed to be optimal in minimizing squared loss. See an econometric textbook such as Judge at al. for the proofs that OLS offers the <span>Best, Linear, Unbiased Estimator</span><span> (</span><span>BLUE</span><span>) in such cases. </span> </p>
<p></p>
<p>Other problems:</p>
<p>Can you fix the conjecture M=1 ? clearly wrong, unless N=1.</p>
<p>How could one address what Steve points out-- when <span>the univariate b_i regression coefficient has a positive sign but the corresponding coefficient in the data generating process is negative, your two step estimation of a_i is biased, and then in order to get unbiased predictions of Y the other coefficients must be biased as well. </span> </p>
<p>In sum, when the data is drawn according to this relationship </p>
<p><span>1*$x + 0.5*$y - 0.3*$z - 2*$u</span></p>
<p><span>the two step 'jacknife' is dominated by countless alternatives, irrespective of how much <span>cross-correlation is added.</span></span></p>
<p>I think the basic problem is that this statement is misleading:</p>
<p><span> When independent variables are correlated ... traditional linear regression does not work as well.</span></p>
<p><span>OLS is typically ideal for such cases as long as linearity is upheld and the independent variable candidates are know -- exactly how your data set is constructed. In the pathological 99% correlation cases, Bayesian/ridge regression techniques or simply throwing away the effectively redundant variables is typically acceptable. </span></p>
<p><span>Finally, to echo Steve, we should never claim something is useless based on a single simulation test. In fact, your simulation is case where the presumed useless statistical method wins the horse race. Theory actually predicted such as result. Notch one up for 200 years of statistical research. </span></p> So you've decided that a meth…tag:www.datasciencecentral.com,2014-03-28:6448529:Comment:1561292014-03-28T16:40:27.626ZSteve Simonhttps://www.datasciencecentral.com/profile/SteveSimon
<p>So you've decided that a methodology that has worked well for 200 years is "useless"? And your proof that it is useless is a single simulation test of a competing methodology?</p>
<p></p>
<p>Sorry, but I'm not convinced yet. The problem that I see is that you can't accommodate the fairly common case where the univariate regression coefficient is of a different sign than the corresponding coefficient in a multivariate model. M multiplies all of the variables, so it can't flip the sign of just…</p>
<p>So you've decided that a methodology that has worked well for 200 years is "useless"? And your proof that it is useless is a single simulation test of a competing methodology?</p>
<p></p>
<p>Sorry, but I'm not convinced yet. The problem that I see is that you can't accommodate the fairly common case where the univariate regression coefficient is of a different sign than the corresponding coefficient in a multivariate model. M multiplies all of the variables, so it can't flip the sign of just some of the coefficients. Splitting into M and M' doesn't help much, because looking at 2^n groupings is more computationally complex than matrix inversion.</p>
<p></p>
<p>It looks like M works somewhat like a shrinkage parameter, so you might consider how this compares to ridge regression. You might want to compare it to Diagonal Linear Discriminant Analysis, which also avoids the inversion of a large matrix.</p>
<p></p>
<p>I don't mean to sound so negative. There's way too much hype in your post, but your methodology may have merit, if only for very large problems where matrix inversion is impractical.</p>
<p></p>
<p>Steve Simon, <a href="http://www.pmean.com" target="_blank">www.pmean.com</a></p>
<p></p>
<p>P.S. Please use something other than Excel for your regression calculations.</p>
<p></p>
<p><a href="http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction/#excel" target="_blank">http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction/#excel</a></p>
<p><a href="http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf" target="_blank">http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf</a></p>
<p><a href="http://www.practicalstats.com/xlsstats/excelstats.html" target="_blank">http://www.practicalstats.com/xlsstats/excelstats.html</a></p>
<p><a href="http://homepage.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf" target="_blank">http://homepage.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf</a></p>
<p></p>
<p>Some of these references are dated, but I have seen no evidence that Microsoft care about or has tried to correct any of the problems noted here.</p>
<p></p>
<p></p> Dear Vincent, I took your art…tag:www.datasciencecentral.com,2014-03-23:6448529:Comment:1546122014-03-23T15:44:21.772ZSebastian Sohrhttps://www.datasciencecentral.com/profile/SebastianSohr
<p>Dear Vincent, I took your article as an incentive to refresh my knowledge on regression.</p>
<p>I came across the following points:</p>
<p>Part 1 (beautiful case).</p>
<p>I think the prerequisites are so strong that your simple solution is identical with the linear regression solution:</p>
<p>Let X = ( X_1... X_n ) . E(X_i) = 0 (expectation), cov(X_i, X_j) for i < j means that the X_i are orthogonal:</p>
<p>X' X = diag ( var(X_1), ..., var(X_n) ) where X' is the transposed matrix and…</p>
<p>Dear Vincent, I took your article as an incentive to refresh my knowledge on regression.</p>
<p>I came across the following points:</p>
<p>Part 1 (beautiful case).</p>
<p>I think the prerequisites are so strong that your simple solution is identical with the linear regression solution:</p>
<p>Let X = ( X_1... X_n ) . E(X_i) = 0 (expectation), cov(X_i, X_j) for i < j means that the X_i are orthogonal:</p>
<p>X' X = diag ( var(X_1), ..., var(X_n) ) where X' is the transposed matrix and diag(...) denotes the diagonal matrix.</p>
<p>Also, you have X'Y = ( cov(Y, X_1),...,cov(Y, X_n) ).</p>
<p>Taking the linear regression solution a = (X' X)^{-1} X'Y, where (...)^{-1} denotes the inverse matrix, you get exactly</p>
<p>a_i = cov(Y, X_i) / var(X_i).</p>
<p></p>
<p>Part 2.</p>
<p>I think, correlated X_i's are not ugly. They contain observation in such a way that there entries X_j,i (j=1,...,number of observations belong to the same entity (e.g. the same patient) for the same j. These almost always should be correlated. Concluding from this that linear regression does not work sounds a little bit strong for me.</p>
<p>Concerning the multiplier M : It is not 1 in the beautiful case. If you take n = number of obs here, X is invertible, therefore we have for the linear regression solution which minimizes Z</p>
<p>Y - Xa = Y - X(X'X)^{-1}X'Y = X X^{-1}X'^{-1}X'Y = Y - Y = 0.</p>
<p>Another remark:</p>
<p>If your method is correct and delivers the exact solution in the deterministic case (Z = 0), it should also apply to linear equation systems of the form y = Xa. Your formula also delivers the inverse of a matrix X.</p>
<p>All linear algebra textbooks have to be rewritten ;-) .</p>
<p>If I'm wrong somewhere or everywhere, please tell.</p>
<p></p> Vincent, thanks for sharing t…tag:www.datasciencecentral.com,2014-03-17:6448529:Comment:1528162014-03-17T20:13:57.629ZJean-Marc Patenaudehttps://www.datasciencecentral.com/profile/JeanMarcPatenaude
<p>Vincent, thanks for sharing the link on your article on correlation and the use of the 1-norm to address some of its limitations. I found it interesting and your point well said regarding how to use it in the analysis of data buckets or clusters. </p>
<p></p>
<p>The common theme once again is to ensure that one fully understands the mathematical properties and limitations of the model or statistic being considered (such as correlation), and not apply it blindly. In particular, Pearson's…</p>
<p>Vincent, thanks for sharing the link on your article on correlation and the use of the 1-norm to address some of its limitations. I found it interesting and your point well said regarding how to use it in the analysis of data buckets or clusters. </p>
<p></p>
<p>The common theme once again is to ensure that one fully understands the mathematical properties and limitations of the model or statistic being considered (such as correlation), and not apply it blindly. In particular, Pearson's correlation is one of those that people grossly over-abuse without realizing its very important limitations (despite its usefulness, let's not forget it). To illustrate this, let us all remember Anscombe's quartet...</p>
<p><a href="http://en.wikipedia.org/wiki/Anscombe" target="_blank">http://en.wikipedia.org/wiki/Anscombe</a>'s_quartet</p>
<p></p> Jean-Marc, I could not agree…tag:www.datasciencecentral.com,2014-03-17:6448529:Comment:1525832014-03-17T17:36:10.656ZVincent Granvillehttps://www.datasciencecentral.com/profile/VincentGranville
<p>Jean-Marc, I could not agree more with you. I have to admit, here I've used the lazy approach, which consists of minimizing variance - what statisticians have been doing for a few centuries - that is, using the 2-norm to quickly get an easy solution through simple mathematics / matrix algebra. But the 1-norm is far superior, see my article <a href="http://www.analyticbridge.com/profiles/blogs/correlation-and-r-squared-for-big-data" target="_blank">on 1-norm correlations</a>.</p>
<p>Jean-Marc, I could not agree more with you. I have to admit, here I've used the lazy approach, which consists of minimizing variance - what statisticians have been doing for a few centuries - that is, using the 2-norm to quickly get an easy solution through simple mathematics / matrix algebra. But the 1-norm is far superior, see my article <a href="http://www.analyticbridge.com/profiles/blogs/correlation-and-r-squared-for-big-data" target="_blank">on 1-norm correlations</a>.</p>