Hi everybody, here's a summary of my study followed with few question on randomforest
Population : 3300 observables, minority class 150 observables (~4%)
Predictors : ~70 , just 1 numerical, all others are boolean
I use features selection in order to reduce the number of predictors
I remove predictors with lowest variance, lowest correlation with my target variable, also i use t-test (mean difference between 2 classes)
I keep around 20 predictors for 150 observables in my signal
NB:I didnt use yet chi2 evaluation, entropy or RFE in order to reduce the number of predictors
The metric i use is f1score in order to catch minorty class (otherwise the model classifiy most observables in majority class)
It's really difficult to have stable results on 60/40 or 50/50 Train/Test, during parameters tuning.
I use Cross Validation with RandomForest with 20stratified folds and i tune parameters of the model
For the following parameters : n_estimators=120, max_depth=10, max_features=’sqrt’, bootstrap=False
i've found 50 TP, 100 FN, 2950 TN , 200 FP
summary: i catch 33% of the signal (minority class) and 94% of the majority one
My questions :
max_depth : it seems that if i use higher values it's better but whatmeans a deep of 30 with 20 predictors, it's seems strange to me! I've read that high depth can reach to overfitting. How can i manage this problem
For n_estimators it's the same, higher values seems better. Can i trust results if the value is higher than 300?
boostrap :i ve sometimes better results with boostrap True but with 150 observables in my minority class, it seems not real to me to boostrap (resample with replacement)
I didnt try to use the parameter class_weight to balanced.
I'm sorry for my english, i hope people can help me to improve my results
Your data set is a bit small. The classic solution is to over-sample under-represented classes. I've been doing it routinely but on data sets with 50+ million observations, where the class "fraud" (versus "non fraud") represented only 4 out of 10,000 observations. If you can get a much bigger data set, that would help.
Also, with such as small, yet unbalanced data set, I would use less than 5 predictors. Or maybe use various ratios of these predictors, which may carry more information than the original features. Anyway, cross-validation should help you test how well past data predict future outcome, if you can break you training set into two parts: (past, the control group) and future (the test group.)
I understand for over-sampling i'll see what to do.
Why did you say " I would use less than 5 predictors", is it in comparison of 150 observables in minority class ?
Let say 30 observables per predictor ? I've read long time ago one can keep more than 10 (in my case 15 predictors)
Do you have any rules on depth of trees in Randomforest in my use case ? also on number of trees,
I've read that using deep tress 20,30,50, etc ... can lead to overfitting. A datascientist explain me tha tcross validation with Randomforest car avoid this risk so i can use more deep tree but i dont know about the limit
I'll work on aggregate more data/information about my observables
Yes 10 predictors are OK, but the data set seems a bit small, so the risk of over-fitting is higher than with (say) 50,000 observations.