This is an article which attempts to detect dependable variables with non-linear method.

I'm going to apply a method for checking variable dependency which was introduced in my previous post. Because the "dependency" I get with this rule is not true dependency as defined in Probability then I will call variables *practically dependent at a confidence level "alpha"*, where "alpha" is a confidence level of bootstrapped confidence intervals.

I will modify the idea slightly: I won’t compute means with interval lengths, because it is sufficient to verify that confidence intervals for Pr(A and B) and Pr(A)Pr(B) do not intersect. For this I only need the confidence interval endpoints. In addition I’ve noted that if a variable has only two values, then it is enough to check for practical dependency of only one value, because relative frequency values for such variable are complementary.

I have tried “boot” package mentioned in the previous post and discovered that it is not convenient for a really big data. It generates a huge matrix and then calculates a statistic for each column. Such approach requires a lot of memory. It is more prudent to generate a vector, calculate the statistic and then generate next vector, replacing the previous.

I’m going to use data from KDD cup 1998, from here. There is a training data set in text format, a data dictionary and some other files.

I will load the data set, which is already in my working directory. Then we can look at our data set and compare it with the data dictionary, as usual.

*To read more, click here.*

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central