One of the most intuitive and popular methods of data mining that provides explicit rules for classification and copes well with heterogeneous data, missing data, and nonlinear effects is decision tree. It predicts the target value of an item by mapping observations about the item.
You can perform either classification or regression tasks here. For example, identifying fraudulent transactions using credit cards would be a classification task while forecasting prices of stock would be regression task.
Following are the common areas for applying decision trees:
Decision tree technique is used to detect the criteria for dividing individual items of a group into n predetermined classes (Often, n=2 represents a balanced tree, which means a largest of two child nodes for each parent node.)
Firstly, a variable is taken as the root node. This variable should best separate the classes. Then, each individual variable is divided into the given classes such that subpopulations called as nodes are generated. Same operation repeats on each new node obtained until no further separation of individuals are possible. Construction is such that each terminal node consists of individuals of a single class.
A tree with each node having no more than two child nodes is called binary tree. The first node is called root and the terminal nodes are known as leaves.
To create a decision tree, you need to follow certain steps:
i) Choosing a Variable – Choice of the variable that best separates the individuals of each class depends on the type of decision tree. The Precise criterion for choice of variable and separation condition on a variable depends on the type of tree.
A binary variable allows single separation condition while for a continuous variable having n separate values, there are n-1 possible separation conditions.
The condition for separation is:
X <= mean(xk, xk+1)
When best separation is found, it is applied and then the operation is repeated on each node to increase the discrimination.
The density of node is the ratio of its number of individuals to the total number of population.
ii) Once the best separation is found, classes are then split to create child nodes. A variable is taken out at this step also. To choose the best separation, used criteria are:
iii) The X2 Test – To test the independence of two variables X and Y, X2 test is used. If
Oij gives the term on the left-hand side of the equality symbol and Tij gives the term on the right, the test of independence of X and Y is X2, which is calculated by using the following formula:
The number of degrees of freedom can be calculated as follows:
p = (number of rows – 1) X (number of columns – 1)
i) The Gini Index – It is a measurement of the purity of nodes. It is used for all types of dependent variables and is calculated by using the following formula:
In the preceding formula: fi, i=1, . ., p, are the relative frequencies in the node of the p classes to be predicted.
More evenly distributed the classes are in the node, higher the Gini index will be. As the purity of node increases, Gini index decreases. Gini index measures the probability that 2 individuals picked at random with replacement from a node belong to two different classes.
ii) Assigning Data to Nodes – Once the tree is constructed and division criterion has been established, each individual can be assigned to exactly one leaf. This is determined by the values of the independent variables for the individual.
A leaf is assigned to a class if the cost of assigning that leaf to any other class is higher than assigning the leaf to the current class.
The cost is calculated as:
Starting with an error rate of each leaf, error rate of the tree also called as the total cost of tree or risk of the tree is calculated.
iii) Pruning the Tree: When you have deep decision trees, you may need to do pruning because they may contain some irrelevant nodes in the leaves. An algorithm is said to be good if it creates a largest-sized tree and automatically prunes it after detecting the optimal pruning threshold. It is necessary to use Cross-validation and combine error rates found for all possible subtrees to choose the best subtree.
It is important to shorten the branches of very deep trees to avoid creating very small nodes with no real statistical significance.
Decision trees belong to a class of recursive partitioning algorithms that are simple to describe and implement. For each decision tree algorithms described earlier, the algorithm steps are as follows:
The stop condition may be as complicated as a statistical significance test or as simple as a least record count.
Decision trees are nonlinear predictors. This means that the decision boundary between target variable classes is nonlinear. The extent of nonlinearities depends on a number of splits in the tree.
As tree becomes more complex by increasing its depth, more piecewise constant separators are built into the decision boundary to provide nonlinear separation.
Decision trees are based on forwarding selection mechanism; so you cannot visit a split once it is created.
Certain guidelines should be followed for creating decision trees:
Decision Tree Options
There are three most Common Decision Tree Algorithms:
CART is invented in 1984 by L Breiman, JH Friedman, RA Olshen and CJ Stone and is one of the most effective and widely used decision trees. It uses the Gini index to find the best separation of each node.
CART has two primary benefits; generality and performance:
i) Generality – It means many categories of the dependent variable can be finite or unlimited, and CART can be used for both classification and prediction – there is an appropriate node splitting criterion for each type of problem.
Generality is enhanced by the capacity to process missing values by replacing each variable with an equally splitting variable(Provide approximately same purity of nodes as an original variable) or an equally reducing variable(Distribute individuals in an approximately same way as an original variable).
ii) Performance – The performance of CART is primarily due to its pruning mechanism. Continuing node splitting process constructs Largest trees as far as possible. The algorithm then deduces many nested subtrees by successive pruning operations.
CART has the following drawbacks as well:
The Australian researcher J. Ross Quinlan develope C5.0 as an improvement of his earlier trees ID3 and C4.5 but it is less popular than CART tree.
It includes a procedure for converting trees into sets of rules and aims to generalize each rule by eliminating the conditions which do not decrease the error rate.
A special feature of C5.0 is its ability to separate the population into more than two subpopulations at each step; it is not binary. This is because of its treatment of the qualitative variables which give rise to a child node for each category.
The drawback of this tree is that the frequencies of the nodes decrease more rapidly, together with their statistical reliability and their generalizing capacity.
The principle of CHAID tree was stated in 1975 by JA Hartigan and algorithm was devised in 1980 by GV Kass.
CHAID is a product of the first decision tree, the AID tree (1963) of Morgan and Sonquist, but the latter was based on the principle of analysis of variance for handling continuous dependent variables while producing binary trees.
Unlike CART, CHAID tree does not replace missing values with equally splitting or equally reducing values. It handles all the missing values as a single class which it may merge with another class if appropriate.
To find the most significant variable of each node, the CHAID tree uses the X2test. It is applicable only with discrete or qualitative independent variables. Creating a CHAID tree includes the following steps:
As CHAID is not binary, it produces trees that tend to be wider rather than deeper. It has no pruning function. So the construction stops on the creation of the largest tree and thus reaches the stop criteria.
CHAID is useful for transforming quantitative data into qualitative data of continuous variables. For it, you need to do only 1-3 steps discussed earlier:
A comparison of the three mechanisms is given in the table:
|Tree Algorithm||Splitting Criterion||Input Variables||Target Variable||Binary or Multi-Way Splits||Complexity Regularization||Handling of Class Imbalance|
|CART||Gini Index||Categorical or Continuous||Categorical or Continuous||Binary||Pruning||Priors or Costs|
|C5||Gain Ratio, based on Entropy||Categorical or Continuous||Categorical||Binary or Multi-Way||Pruning||Misclassification Costs|
|CHAID||Chi-square Test||Categorical||Categorical||Binary or Multi-Way||Significance Test||Misclassification Costs|
Hence, CART and CH5.0 decision trees are similar in several ways. They both build trees to full size, deliberately over fitting the tree and then they prune the tree back to the depth that trades off error with complexity in a suitable manner. They both can use continuous and categorical input variables.
CHAID splits only if 2 or more subgroups split creates are statistically significant via the chi-square test. It is must to group continuous inputs and output into categorical values. This is done automatically by the software during training.
Decision trees that are for classification can be used for prediction. It achieves this by changing the node splitting criterion. The aim of doing this is that:
This means you must choose child nodes that cut intra-class variance and maximize interclass variance.
To explain the above points, let us take an example of a CHAID tree. It divides 163 countries into five groups on the basis of the difference in GNP per citizen. The most discriminating criterion, here, is the energy consumption. The group with medium GNP split again by life expectancy.
You need a package for building decision trees in R. This Is called rpart. It is used for classification by generating a decision and regression trees. This depends on recursive partitioning algorithm. It helps the user identify the structure of data; develop decision rules for decision trees. R builds decision trees as a two stage process as follows:
Mcycle dataset shows the relationship between acceleration and time. The aim is to predict acceleration as a function of time. Data is as below:
Perform the following steps for building an R decision trees for the given motorcycle data.
Loading the Data of mcycle as below:
Creating Scatterplot for acceleration and times as below:
Generating the Tree and Storing Result in mct object as below:
Plot the Graph for the Tree
Decision trees are one of the most popular statistical techniques. The advantages of using R decision trees are as follows:
The disadvantages of using R decision trees are as follows: