*Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments.*

This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables.

Our goal is to produce a regression tool that can be used as a black box, be very robust and parameter-free, and usable and easy-to-interpret by non-statisticians. It is part of a bigger project: automating many fundamental data science tasks, to make it easy, scalable and cheap for data consumers, not just for data experts. Our previous attempts at automation include

- Data driven confidence intervals
- Hidden decision trees (HDT)
- Fast Combinatorial Feature Selection
- Map-Reduce applied to log file processing
- Building giant taxonomies

Readers are invited to further formalize the technology outlined here, and challenge my proposed methodology.

**1. Introduction**

As in our previous paper, without loss of generality, we focus on linear regression with centered variables (with zero mean), and no intercept. Generalization to logistic or non-centered variables is straightforward.

Thus we are still dealing with the following regression framework:

Y = a_1 * X_1 + ... + a_n * X_n + noise

Remember that the solution proposed in our previous paper was

- b_i = cov(Y, X_i) / var(X_i), i = 1, ..., n
- a_i = M * b_i, i = 1, ..., n
- M (a real number, not a matrix) is chosen to minimize var(Z), with Z = Y - a_1 * X_1 + ... + a_n * X_n

When cov(X_i, X_j) = 0 for i < j, my regression and the classical regression produce identical regression coefficients, and M = 1.

Terminology: Z is the noise, Y is the (observed) response, the a_i's are the regression coefficients, and and S = a_1 * X_1 + ... + a_n * X_n is the estimated or predicted response. The X_i's are the independent variables or features.

**2. Re-visiting our previous data set**

I have added more cross-correlations to the previous simulated dataset consisting of 4 independent variables, still denoted as x, y, z, u in the new, updated attached spreadsheet. Now corr(x, y) = 0.99.

Tags:

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions