In this post I will sometimes use a term “variable” for “feature”(“predictor”“) or”outcome“(”predicted value“”).

The question of variable dependencies for a particular data is quite important, because it can help to reduce an amount of predictors used for a model. Or it can tell us what feature is not helpful for a model construction, although it still can be used for engineering of another predictor. For example sometimes it is better to compute speed than to use distance values. In addition some standard algorithms assume independence of features and knowing how close to reality such assumption is useful.

The standard way to check dependencies of variables is to compute their covariance matrix. But it yields only linear dependencies. If dependencies are not linear then the covariance matrix may not pick it up. There are well known and numerous examples so I will not repeat them again.

Let us take a different approach. The definition of independent events is the following equality:

**Pr**(A and B)=**Pr**(A)**Pr**(B).

Hence for dependent events we should have inequality. A simple measure of such disparity is an absolute value of difference of the expressions on the right hand side and on the left hand side:

|**Pr**(A and B)−**Pr**(A)**Pr**(B)|.

Since in Data Science we work with probability estimations, then the true equality in the first formula is not likely anyway. The question is, how far from zero may be the difference in the second formula for us to believe that considered variables are dependent?

Well, in Data Science we can estimate bounds of a particular value with confidence intervals computed from a given data. For example with R it can be done with package “boot” and with python it is done with “scikits.bootstrap”. Thus confidence intervals of **Pr**(A and B), **Pr**(A) and **Pr**(B) can be estimated with desired degree of probability. What is left to work out is a confidence interval of a product, **Pr**(A)**Pr**(B)

To estimate bounds for the product we can use a standard approach from Numerical Analysis which is used to compute an accrued error of calculation caused by truncation errors.

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- Optimization and The NFL’s Toughest Scheduling Problem - June 23

At first glance, the NFL’s scheduling problem seems simple: 5 people have 12 weeks to schedule 256 games over the course of a 17-week season. The scenarios are potentially well into the quadrillions. In this latest Data Science Central webinar, you will learn how the NFL began using Gurobi’s mathematical optimization solver to tackle this complex scheduling problem. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- Optimization and The NFL’s Toughest Scheduling Problem - June 23

At first glance, the NFL’s scheduling problem seems simple: 5 people have 12 weeks to schedule 256 games over the course of a 17-week season. The scenarios are potentially well into the quadrillions. In this latest Data Science Central webinar, you will learn how the NFL began using Gurobi’s mathematical optimization solver to tackle this complex scheduling problem. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central