Hidden decision trees (HDT) is a technique patented by Dr. Granville, to score large volumes of transaction data. It blends robust logistic regression with hundreds small decision trees (each one representing for instance a specific type of fraudulent transaction) and offers significant advantages over both logistic regression and decision trees: robustness, ease of interpretation, and no tree pruning, no node splitting criteria. It makes this methodology powerful and easy to implement even for someone with no statistical background.
Hidden Decision Trees is a statistical and data mining methodology (just like logistic regression, SVM, neural networks or decision trees) to handle problems with large amounts of data, non-linearity and strongly correlated independent variables.
The technique is easy to implement in any programming language. It is more robust than decision trees or logistic regression, and helps detect natural final nodes. Implementations typically rely heavily on large, granular hash tables.
No decision tree is actually built (thus the name hidden decision trees), but the final output of a hidden decision tree procedure consists of a few hundred nodes from multiple non-overlapping small decision trees. Each of these parent (invisible) decision trees corresponds e.g. to a particular type of fraud, in fraud detection models. Interpretation is straightforward, in contrast with traditional decision trees.
The methodology was first invented in the context of credit card fraud detection, back in 2003. It is not implemented in any statistical package at this time. Frequently, hidden decision trees are combined with logistic regression in an hybrid scoring algorithm, where 80% of the transactions are scored via hidden decision trees, while the remaining 20% are scored using a compatible logistic regression type of scoring.
Hidden decision trees take advantage of the structure of large multivariate features typically observed when scoring a large number of transactions, e.g. for fraud detection. The technique is not connected with hidden Markov fields.
The model presented here is used in the context of click scoring. The purpose is to create predictive scores, where score = f(response), that is, score is a function of the response. The response is sometimes referred to as the dependent variable in statistical and predictive models.
Traditional models to be compared with hidden decision trees include logistic regression, decision trees, naïve Bayes.
Hidden decision trees (HDT) use a one-to-one mapping between scores and multivariate features. A multivariate feature is a rule combination attached to a particular transaction (that is, a vector specifying which rules are triggered, which ones are not), and is sometimes referred to as flag vector or node.
HDT fundamentals, based on typical data set:
Each top node (or multivariate feature) is a final node from a hidden decision tree. There is no need for tree pruning / splitting algorithms and criteria: HDT is straightforward, fast, and can rely on efficient hash tables (where key=feature, value=score). The top 500 nodes, used to classify (that is, score) 80% of transactions, come from multiple hidden decision trees - hidden because you never used a decision tree algorithm to produce them.
The remaining 20% transactions scored using alternate methodology (typically, logistic regression). Thus HDT is a hybrid algorithm, blending multiple, small, easy-to-interpret, invisible decision trees (final nodes only) with logistic regression.
Note that in the logistic regression, we use constrained regression coefficients. These coefficients depend on 2 or 3 top parameters and have the same sign as the correlation between the rule they represent, and the response or score. This make the regression non-sensitive to high cross correlations among the “independent” variables (rules) which are indeed not independent in this case. This approach is similar to ridge regression, logic regression or Lasso regression. The regression is used to fine tune the top parameters associated with regression coefficients. I will later in this book show that approximate solutions (we are doing approximate logistic regression here) are - if well designed - almost as accurate as exact solutions, but can be far more robust.
We are dealing with two types of scores:
To blend the scores,
HDT nodes provide an alternate segmentation of the data. One large, medium-score segment corresponds to neutral transactions (triggering no rule). Segments with very low scores correspond to specific fraud cases. Within each segment, all transactions have the same score. HDT’s provide a different type of segmentation than PCA (principal component analysis) and other analyses.
Example: Scoring Internet Traffic
The figure below shows the score distribution with a system based on 20 rules, each one having a low triggering frequency. It has the following features:
Figure 5.1: Example of score distribition based on HDT’s
The figure below compares scores with conversion rates (CR). HDT’s were applied to Internet data, scoring clicks with a score used to predict chances of conversion (a conversion being a purchase, a click out, sign-up on some landing page). Overall, we have a rather good fit.
Peaks in the figure below could mean:
Valleys could mean:
Peaks and valleys can also be cause if you blend together multiple types of conversions: traffic with 0.5% CTR together with traffic with 10% CTR.
Figure 5.2: HDT scores to predict conversions
HDT is a fast algorithm, easy to implement, can handle large data sets efficiently, and the output is easy to interpret.
It is non parametric and robust. The risk of over-fitting is small if no more than top 500 nodes are selected and ad-hoc cross validation techniques used to remove unstable nodes. It offers built-in, simple mechanism to compute confidence intervals for scores. See also next section.
HDT is hybrid algorithm to detect multiple types of structures: linear structures via the regression, and non linear structures via the top nodes.