In this article, I proposes a simple metric to measure predictive power. It is used for combinatorial feature selection, where a large number of feature combinations need to be ranked automatically and very fast, for instance in the context of transaction scoring, in order to optimize predictive models. This is about rather big data, and we would like to see an Hadoop methodology for the technology proposed here. It can easily be implemented in a Map Reduce framework. It was developed by the author in the context of credit card fraud detection, and click/keyword scoring. This material will be part of our data science apprenticeship, and included in our Wiley book.
Feature selection is a methodology used to detect the best subset of features, out of dozens or hundreds of features (also called variables or rules). By “best”, we mean with highest predictive power, a concept defined in the following subsection. In short, we want to remove duplicate features, simplify a bit the correlation structure (among features) and remove features that bring no value, such as a features taking on random values, thus lacking predictive power, or features (rules) that are almost never triggered (except if they are perfect fraud indicators when triggered).
The problem is combinatorial in nature. You want a manageable, small set of features (say 20 features) selected from (say) a set of 500 features, to run our hidden decision trees (or some other classification / scoring technique) in a way that is statistically robust. But there are 2.7 * 10^{35} combinations of 20 features out of 500, and you need to compute all of them to find the one with maximum predictive power. This problem is computationally intractable, and you need to find an alternate solution. The good thing is that you don’t need to find the absolute maximum; you just need to find a subset of 20 features that is good enough.
One way to proceed is to compute the predictive power of each feature. Then, add one feature at a time to the subset (starting with 0 feature) until you reach either
At each iteration, choose the feature to be added among the two remaining features with the highest predictive power: you will choose (among these two features) the one that increases the overall predictive power (of the subset under construction) most. Now you have reduced your computations from 2.7 * 10^{35} to 40 = 2 * 20.
Technical note: Additional step to boost predictive power. Remove one feature at a time from the subset, and replace it with a feature randomly selected from the remaining features (from outside the subset). If this new feature boosts overall predictive power of subset, keep it, and otherwise switch back to old subset. Repeat this step 10,000 times or until no more gain is achieved (whichever comes first).
Finally, you can add two or three features at a time, rather than one. Sometimes, combined features have far better predictive power than isolated features. For instance if feature A = country, with values in {USA, UK} and feature B = hour of the day, with values in {“day - Pacific Time”, “night - Pacific Time”}, both features separately have little if any predictive power. But when you combine both of them, you have a much more powerful feature: UK/night is good, USA/night is bad, UK/day is bad, and USA/day is good. Using this blended feature also reduces the risk of false positives / false negatives.
Also, in order to avoid highly granular features, use lists. So instead of having feature A = country (with 200 potential values), and feature B = IP address (with billions of potential values), use:
Predictive Power of a Feature, Cross-Validation
Here we illustrate the concept of predictive power on a subset of 2 features, to be a bit more general. Let’s say wie have two binary features A and B taking two possible values 0 or 1. Also, in the context of fraud detection, we assume that each observation in the training set is either Good (no fraud) or Bad (fraud). The fraud status (G or B) is called the response or dependent variable in statistics. The features A and B are also called rules or independent variables.
Cross validation
First, split your training set (the data where the response - B or G - is known) into two parts: control and test. Make sure that both parts are data-rich: if the test set is big (millions of observations) but contain only one or two clients (out of 200), it is data-poor and your statistical inference will be negatively impacted (low robustness) when dealing with data outside the training set. It is a good idea to use two different time periods for control and test. You are going to compute the predictive power (including rule selection) on the control data. When you have decided on a final, optimum subset of features, you will then compute the predictive power on the test data. If the drop in predictive power is significant in the test data (compared with control), something is wrong with your analysis: detect the problem, fix it, start over again. You can use multiple control and test sets: this will give you an idea of how the predictive power varies from one control to another control set. Too much variance is an issue that should be addressed.
Predictive power
Using our above example with two binary features A, B taking on two values 0, 1, we can break the observations from the control data set into 8 categories
Let denote as n_{1}, n_{2} … n_{8} the number of observations in each of these 8 categories, and let introduce the following quantities:
P_{00} = n_{5} / (n_{1} + n_{5}), P_{01} = n_{6} / (n_{2} + n_{6}), P_{10} = n_{7} / (n_{3} + n_{7}), P_{11} = n_{8} / (n_{4} + n_{8})
p = (n_{5} + n_{6} + n_{7} + n_{8}) / (n_{1} + n_{2} + … + n_{8})
Let’s assume that p, measuring the overall proportion of fraud, is less than 50% (that is, p<0.5, otherwise we can swap between fraud and non-fraud). For any r between 0 and 1, define the W function (shaped like a W), based on a parameter a (0 < a < 1, and my recommend is a = 0.5-p) as follows:
Typically, r = P_{00}, P_{01}, P_{10} or P_{11}. The W function has the following properties:
Now let’s define the predictive power:
H = P_{00} W(P_{00}) + P_{01} W(P_{01}) + P_{10} W(P_{10}) + P_{11} W(P_{11})
The function H is the predictive power for the feature subset {A, B} having four bins 00, 01, 10, and 11, corresponding to (A=0, B=0), (A=0, B=1), (A=1, B=0) and (A=1, B=1). Although H appears to be remotely related to entropy, our H was designed to satisfy nice properties, and to be parameter-driven, thanks to a. Unlike entropy, our H is not based on physical concepts or models; it is actually a synthetic (though useful) metric.
Note that the weights P_{00}… P_{11} in H guarantee that bins with low frequency (that is, low triggering rate) have low impact on H. Indeed, I recommend setting W(r) to 0 for any bin that has less than 20 observations. For instance, the triggering rate for bin 00 is (n_{1} + n_{5}) / (n_{1} + … + n_{8}), observations count is n_{1} + n_{5}, and r = P_{00} = n_{5} / (n_{1} + n_{5}) for this bin. If n_{1} + n_{5} = 0, set P_{00} to 0 and W(P_{00}) to 0. I actually recommend to do this not just if n_{1} + n_{5} = 0, but also whenever n_{1} + n_{5} < 20, especially if p is low (if p is very low, say p < 0.01, you need to over-sample bad transactions when building your training set, and weight the counts accordingly). Of course, the same rule applies to P_{01}, P_{10}, and P_{11}. Note that you should avoid feature subsets that have a large proportion of observations spread across a large number of almost empty bins, as well as feature subsets that produce a large number of empty bins: observations outside the training set are likely to belong to an empty or almost empty bin, and it leads to high-variance predictions. To avoid this drawback, stick to binary features, select up to 20 features, and use our (hybrid) hidden decision tree methodology for scoring transactions. Finally, P_{ij} is the naive estimator of the probability P(A=i, B=j) for i,j = 0,1.
The predictive power H has interesting properties:
Data structure, computations
You can pre-compute all the bin counts n_{i}’s for the top 20 features (that is, features with highest predictive power) and store them in a small hash table with at most 2 * 2^{20} entries (approx. 2 million; the factor two is because you need two measurements per bin: number of B’s, and number of G’s). An entry in this hash table would like
$Hash{01101001010110100100_G} = 56,
meaning that Bin # 01101001010110100100 has 56 good (G) observations.
The hash table is produced by parsing your training set one time, sequentially: for each observation, compute the flag vector (which rules are triggered, that is the 01101001010110100100 vector in this example), check if it’s good or bad, and update (increase count by 1) the associated hash table entry accordingly, with the following instruction:
$Hash{01101001010110100100_G}++
Then whenever you need to measure the predictive power of a subset of these 20 features, you don’t need to parse your big data set again (potentially billion of observations), but instead, just access this small hash table: this table contains all you need to build your flag vectors and compute scores, for any combination of features that is a subset of the top 20.
You can even do better than top 20, maybe top 30. While this would create a hash table with 2 billion entries, most of these entries would correspond to empty bins and thus would not be in the hash table. Your hash table might contain only 200,000,000 entries, maybe too big to fit in memory, and requiring a Map Reduce / Hadoop implementation.
Even better: build this hash table for the top 40 features. Then it will fully solve your feature selection problem described earlier. However now, your hash table could have up to 2 trillion entries. But if your dataset only has 100 billion observations, then of course your hash table cannot have more than 100 billion entries. In this case, I suggest that you create a training set with 20 million observations, so that your hash table will have at most 20 million entries (probably less than 10 million with all the empty bins), and thus can fit in memory.
You can compute the predictive power of a large number (say 100) of feature subsets by parsing the big 40-feature input hash table obtained in the previous step, then for each flag vector and G/B entry in the input hash table, loop over the 100 target feature subsets to update counts (the n_{i}’s) for these 100 feature subsets: these counts are stored / updated in an output hash table. The key in the output hash table has two components: feature ID and flag vector. You then loop over the output hash table to compute the predictive power for each feature subset. This step can be further optimized.
Related article
Comment
Thank you, Dr. Granville!
You replied to a different comment the other day, I believe.
Thank you for clarifying this.
-- Dean
I thought I corrected the typo and replied to you a few days ago. For whatever reasons, my comment and correction are gone. I corrected it one more time: it's 0 < a < 1. Hopefully the correction will stick this time.
Please correct me if I err, but I believe that there is a minor typographical error above:
"parameter a (0 b>a < 1, and my recommend is a = 0.5-p)"
Did you intend to write, "0 >a < 1"?
Please advise, at your earliest convenience, sir.
@ Vincent. I am glad that we are on the same page on this. I was wrong to make this suggestion and I am sorry about it. Once you provided the HDT reference upon which you have based your method, it became clear to everyone that we came up with very similar methods independently. Let's put all this behind us now and carry on with other, more interesting, matters.
@Dr. Z: You were the one suggesting that I am a plagiarist. Thus my reply. Of course, in reality, for the reasons that you mentioned, none of us are plagiarists.
Unless attempting to advance a technology, scientific research usually references other scientific publications and is not required to look into patents. If that were the case, all electrical and electronic engineers that do research in the field of AC-based system are plaigarising Nikola Tesla who came up with the patents related to AC power. Therefore the point is moot.
@Dr. Z: I read your dissertation. It's dated 2010, while my feature selection methodology (associated with my patented Hidden Decision Trees technology) was published and discussed in conferences several years earlier, as far back as 2003. I could not see any references to my earlier work in your dissertation, so if there is a plagiarist, it must be you.
Additionally, your methodology does not mention the computational aspects when dealing with billions of observations, it does not use my predictive power metric which is very different from your discernibly criteria, and your approach is not combinatorial. It seems that the only things that that we have in common are:
Don't get me wrong. I really would like to see this method in your book. I'm only asking for a reference to my work, which I have to say, took a lot of time and effort to do, considering my programming skills at that time.
http://www.dcs.bbk.ac.uk/research/recentphds/voulgaris.pdf
Chapter 5, section 5.1
Is that good enough a reference or should I point out the particular MATLAB code?
Hi Dr. Z,
You are welcome to provide references. I haven't read and I am not aware of your work.
Vincent
© 2016 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central