Subscribe to DSC Newsletter

How to check/optimize cross validation with randomforest on imbalanced classes ?

Hi everybody, here's a summary of my study followed with few question on randomforest

Population : 3300 observables, minority class 150 observables (~4%)

Predictors : ~70 , just 1 numerical, all others are boolean

I use features selection in order to reduce the number of predictors

I remove predictors with lowest variance, lowest correlation with my target variable, also i use t-test (mean difference between 2 classes)

I keep around 20 predictors for 150 observables in my signal

NB:I didnt use yet chi2 evaluation, entropy or RFE in order to reduce the number of predictors

The metric i use is f1score in order to catch minorty class (otherwise the model classifiy most observables in majority class)

It's really difficult to have stable results on 60/40 or 50/50 Train/Test, during parameters tuning.

I use Cross Validation with RandomForest with 20stratified folds and i tune parameters of the model

For the following parameters : n_estimators=120, max_depth=10, max_features=’sqrt’, bootstrap=False

i've found 50 TP, 100 FN, 2950 TN , 200 FP

summary: i catch 33% of the signal (minority class) and 94% of the majority one

My questions :

max_depth : it seems that if i use higher values it's better but whatmeans a deep of 30 with 20 predictors, it's seems strange to me! I've read that high depth can reach to overfitting. How can i manage this problem

For n_estimators it's the same, higher values seems better. Can i trust results if the value is higher than 300?

boostrap :i ve sometimes better results with boostrap True but with 150 observables in my minority class, it seems not real to me to boostrap (resample with replacement)

I didnt try to use the parameter class_weight to balanced.

I'm sorry for my english, i hope people can help me to improve my results

Views: 168

Reply to This

Replies to This Discussion

Your data set is a bit small. The classic solution is to over-sample under-represented classes. I've been doing it routinely but on data sets with 50+ million observations, where the class "fraud" (versus "non fraud") represented only 4 out of 10,000 observations. If you can get a much bigger data set, that would help.

Also, with such as small, yet unbalanced data set, I would use less than 5 predictors. Or maybe use various ratios of these predictors, which may carry more information than the original features. Anyway, cross-validation should help you test how well past data predict future outcome, if you can break you training set into two parts: (past, the control group) and future (the test group.)

Thank's Vincent,

I understand for over-sampling i'll see what to do.

Why did you say " I would use less than 5 predictors", is it in comparison of 150 observables in minority class ?

Let say 30 observables per predictor ? I've read long time ago one can keep more than 10 (in my case 15 predictors)

Do you have any rules on depth of trees in Randomforest in my use case ? also on number of trees,

I've read that using deep tress 20,30,50, etc ... can lead to overfitting. A datascientist explain me tha tcross validation with Randomforest car avoid this risk so i can use more deep tree but i dont know about the limit

I'll work on aggregate more data/information about my observables

Fabrice

Yes 10 predictors are OK, but the data set seems a bit small, so the risk of over-fitting is higher than with (say) 50,000 observations.

Reply to Discussion

RSS

Videos

  • Add Videos
  • View All

Follow Us

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service