*Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments.*

This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables.

Our goal is to produce a regression tool that can be used as a black box, be very robust and parameter-free, and usable and easy-to-interpret by non-statisticians. It is part of a bigger project: automating many fundamental data science tasks, to make it easy, scalable and cheap for data consumers, not just for data experts. Our previous attempts at automation include

- Data driven confidence intervals
- Hidden decision trees (HDT)
- Fast Combinatorial Feature Selection
- Map-Reduce applied to log file processing
- Building giant taxonomies

Readers are invited to further formalize the technology outlined here, and challenge my proposed methodology.

**1. Introduction**

As in our previous paper, without loss of generality, we focus on linear regression with centered variables (with zero mean), and no intercept. Generalization to logistic or non-centered variables is straightforward.

Thus we are still dealing with the following regression framework:

Y = a_1 * X_1 + ... + a_n * X_n + noise

Remember that the solution proposed in our previous paper was

- b_i = cov(Y, X_i) / var(X_i), i = 1, ..., n
- a_i = M * b_i, i = 1, ..., n
- M (a real number, not a matrix) is chosen to minimize var(Z), with Z = Y - a_1 * X_1 + ... + a_n * X_n

When cov(X_i, X_j) = 0 for i < j, my regression and the classical regression produce identical regression coefficients, and M = 1.

Terminology: Z is the noise, Y is the (observed) response, the a_i's are the regression coefficients, and and S = a_1 * X_1 + ... + a_n * X_n is the estimated or predicted response. The X_i's are the independent variables or features.

**2. Re-visiting our previous data set**

I have added more cross-correlations to the previous simulated dataset consisting of 4 independent variables, still denoted as x, y, z, u in the new, updated attached spreadsheet. Now corr(x, y) = 0.99.

Tags:

© 2017 Data Science Central Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service