This is an article which attempts to detect dependable variables with non-linear method.
I'm going to apply a method for checking variable dependency which was introduced in my previous post. Because the "dependency" I get with this rule is not true dependency as defined in Probability then I will call variables practically dependent at a confidence level "alpha", where "alpha" is a confidence level of bootstrapped confidence intervals.
I will modify the idea slightly: I won’t compute means with interval lengths, because it is sufficient to verify that confidence intervals for Pr(A and B) and Pr(A)Pr(B) do not intersect. For this I only need the confidence interval endpoints. In addition I’ve noted that if a variable has only two values, then it is enough to check for practical dependency of only one value, because relative frequency values for such variable are complementary.
I have tried “boot” package mentioned in the previous post and discovered that it is not convenient for a really big data. It generates a huge matrix and then calculates a statistic for each column. Such approach requires a lot of memory. It is more prudent to generate a vector, calculate the statistic and then generate next vector, replacing the previous.
I’m going to use data from KDD cup 1998, from here. There is a training data set in text format, a data dictionary and some other files.
I will load the data set, which is already in my working directory. Then we can look at our data set and compare it with the data dictionary, as usual.
To read more, click here.