Hi everybody, here's a summary of my study followed with few question on randomforest
Population : 3300 observables, minority class 150 observables (~4%)
Predictors : ~70 , just 1 numerical, all others are boolean
I use features selection in order to reduce the number of predictors
I remove predictors with lowest variance, lowest correlation with my target variable, also i use t-test (mean difference between 2 classes)
I keep around 20 predictors for 150 observables in my signal
NB:I didnt use yet chi2 evaluation, entropy or RFE in order to reduce the number of predictors
The metric i use is f1score in order to catch minorty class (otherwise the model classifiy most observables in majority class)
It's really difficult to have stable results on 60/40 or 50/50 Train/Test, during parameters tuning.
I use Cross Validation with RandomForest with 20stratified folds and i tune parameters of the model
For the following parameters : n_estimators=120, max_depth=10, max_features=’sqrt’, bootstrap=False
i've found 50 TP, 100 FN, 2950 TN , 200 FP
summary: i catch 33% of the signal (minority class) and 94% of the majority one
My questions :
max_depth : it seems that if i use higher values it's better but whatmeans a deep of 30 with 20 predictors, it's seems strange to me! I've read that high depth can reach to overfitting. How can i manage this problem
For n_estimators it's the same, higher values seems better. Can i trust results if the value is higher than 300?
boostrap :i ve sometimes better results with boostrap True but with 150 observables in my minority class, it seems not real to me to boostrap (resample with replacement)
I didnt try to use the parameter class_weight to balanced.
I'm sorry for my english, i hope people can help me to improve my results
These are some questions that, hopefully, may help to move on:
- for f1-score, what is the probability threshold for the classification? is it standard 0.5 or you determined it from AUROC curves?
- since there is one continuous feature, the trees could be quite deep, because trees could use the continuous features in many levels and split on various values. Conceptual essence of Random Forest is to construct trees that over-fit individually and assemble them to reduce variance. max_depth and n_estimators better to increase simultaneously. In your particular case, my concern would be 3300 samples for max_depth = 20 (which is 2**20 leaves). Did you try to visualize couple of deep trees and examine if they make sense (if they use the continuous feature, do splits make sense, etc.)?
- why do you reduce number of features? 70 is not a big number, while theoretically RandomForest can deal with large number of features. Do you get better performance with the reduced number of features?
Here's my comments :
I determine f1-score during "Parameters tuning" of RandomForest. For each set of parameters (few hundreds) i determine the threshold
which give me the best f1score (mostly between 0.09 and 0.13). So i do not use display of ROC curve, just calculation.
The likehood of the majority class is on the 0 value with long tail, for the minority class, non zero value, pic on 0.07 and longer tail than the majority class.
If a use classic threshold 0.5 i just get few True Positif on others are perdict in majority class.
"Did you try to visualize couple of deep trees?"
I dont know hom to visualize inner tree (between more than 100!) i just use 'features_importance' attribut of RandomForest after fitting data.
The continuous variable is always the first one, the more discriminant one
"Why do you reduce number of features?"
Two reason :
-I just want to use relevant predictor, for example, if the mean difference betwwen towo classes are not significant, i remove the predictor (~20 of them);
My choice is empirical, base on ranking on t-test value
-I want to avoid overfitting, the risk to have godd results in appereance, but after generalisation, the results are worst
Two technics to reduce overfitting : Cross validation of course, but alose reduce the complexity of the model by reducing the number of features (lowest variance & correlation with target variable)
I've read that the number of observables per predictor must be greater than 10 (empirical) but i dont know if this empirical value is for all population or just minority class to detect.
So, 150 / 10 = 15 (i still keeping 20 predictors)
Others questions :
n_estimators / max_depth=20
Do you think, i can use this process to determine best values :
-For fixed max_depth (=20), i'll try to display best F1score (at fixed threshold=0.10) by n_estimators to see if there's a stable value around 200.
-For fixed n_estimators (=200), i'll try to display best F1score by max_depth (but with best threshold, cause it's seems that likehood spead out)
Number of predictors
Do you think i can keep more than 20 predictors (example 50) and try , afterward to remove predictors based on the 'features_importance' (attribute of RandomForest)
I use scikit learn / Python to do all this tricky (interesting) stuff