This article was written by Stacey Ronaghan.
This post attempts to consolidate information on tree algorithms and their implementations in Scikit-learn and Spark. In particular, it was written to provide clarification on how feature importance is calculated.
There are many great resources online discussing how decision trees and random forests are created and this post is not intended to be that. Although it includes short definitions for context, it assumes the reader has a grasp on these concepts and wishes to know how the algorithms are implemented in Scikit-learn and Spark.
So, let’s start with…
Decision trees learn how to best split the dataset into smaller and smaller subsets to predict the target value. The condition is represented as the “leaf” (node) and the decision as “branches” (edges). This splitting process continues until no further gain can be made or a preset rule is met, e.g. the maximum depth of the tree is reached.
Decision Tree Algorithms
There are multiple algorithms and the scikit-learn documentation provides an overview of a few of these.
So what do Scikit-learn and Spark use?
Scikit-learn documentation states it is using “an optimized version of the CART algorithm”. Whilst not explicitly mentioned in the documentation, is has been inferred that Spark is using ID3 with CART.
So let’s focus on these two — ID3 and CART.
The algorithm creates a multi-way tree — each node can have two or more edges — finding the categorical feature that will maximize the information gain using the impurity criterion entropy. Not only can it not handle numerical features, it is only appropriate for classification problems.
To read the rest of the article, click here.