Subscribe to DSC Newsletter

Originally posted on DataSciebceCentral, by Dr. GranvilleClick here to read original article and comments.

This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables.

Our goal is to produce a regression tool that can be used as a black box, be very robust and parameter-free, and usable and easy-to-interpret by non-statisticians. It is part of a bigger project: automating many fundamental data science tasks, to make it easy, scalable and cheap for data consumers, not just for data experts. Our previous attempts at automation include

Readers are invited to further formalize the technology outlined here, and challenge my proposed methodology.

1. Introduction

As in our previous paper, without loss of generality, we focus on linear regression with centered variables (with zero mean), and no intercept. Generalization to logistic or non-centered variables is straightforward.

Thus we are still dealing with the following regression framework:

Y = a_1 * X_1 + ... + a_n * X_n + noise

Remember that the solution proposed in our previous paper was

  • b_i = cov(Y, X_i) / var(X_i), i = 1, ..., n
  • a_i = M * b_i, i = 1, ..., n
  • M (a real number, not a matrix) is chosen to minimize var(Z), with Z = Y - a_1 * X_1 + ... + a_n * X_n

When cov(X_i, X_j) = 0 for i < j, my regression and the classical regression produce identical regression coefficients, and M = 1.

Terminology: Z is the noise, Y is the (observed) response, the a_i's are the regression coefficients, and and S = a_1 * X_1 + ... + a_n * X_n is the estimated or predicted response. The X_i's are the independent variables or features.

2. Re-visiting our previous data set

I have added more cross-correlations to the previous simulated dataset consisting of 4 independent variables, still denoted as x, y, z, u in the new, updated attached spreadsheet. Now corr(x, y) = 0.99.

Read full article.

Views: 1082

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service