This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called ‘Feature Selection’. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.
- Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
- Too many variables might result to overfitting which means model is not able to generalize pattern
- Too many variables leads to slow computation which in turns requires more memory and hardware.
Why Boruta Package?
There are a lot of packages for feature selection in R. The question arises ” What makes boruta package so special”. See the following reasons to use boruta package for feature selection.
- It works well for both classification and regression problem.
- It takes into account multi-variable relationships.
- It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
- It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
- It can handle interactions between variables
- It can deal with fluctuating nature of random a random forest importance measure
Perform shuffling of predictors’ values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.
- Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
- Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
- Combine the original ones with shuffled copies
- Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
- Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
- Find the maximum Z score among shadow attributes (MZSA)
- Tag the variables as ‘unimportant’ when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
- Tag the variables as ‘important’ when they have importance significantly higher than MZSA.
- Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged ‘unimportant’ or ‘important’, whichever comes first.
Difference between Boruta and Random Forest Importance Measure
When i first learnt this algorithm, this question ‘RF importance measure vs. Boruta’ made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.
In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.
Is Boruta a solution for all?
Answer is NO. You need to test other algorithms. It is not possible to judge the best algorithm without knowing data and assumptions. Since it is an improvement on random forest variable importance measure, it should work well on most of the times.