How oversampling yielded great results for classifying cases of Sexual Harassment.

When it comes to data science, sexual harassment is an imbalanced data problem, meaning there are few (known) instances of harassment in the entire dataset.

An imbalanced problem is defined as a dataset which has disproportional class counts. Oversampling is one way to combat this by creating synthetic minority samples.

SMOTE — Synthetic Minority Over-sampling Technique — is a common oversampling method widely used in machine learning with imbalanced high-dimensional datasets using Oversampling. The SMOTE technique generates randomly new examples or instances of the minority class from the nearest neighbors of a line joining the minority class sample to increase the number of instances. SMOTE creates synthetic minority samples using the popular K nearest neighbor algorithm.

K nearest neighbors draw a line between the minority points and generate points in the middle of the line. It is a technique that was experimented on, nowadays one can find many different versions of SMOTE which build upon the classic formula. Let’s visualize how oversampling effects the data in general.

For visualization’s sake, two features are picked and from their distribution, it’s clearly seen that the minority samples match the majority sample count.

Let’s compare the predictive power of oversampling vs. not oversampling. Random Forest is used as the predictor in both cases. The ProWSyn version of oversampling is selected as the highest performing oversampling method after all the methods are compared using this Python package.

Let’s check the performance of models pre and post oversampling.

With ProWSyn oversampling implemented, we can see a 13% increase in the ROCAUC score, which is the Area Under the Receiver Operating Characteristics curve, from 84% to 97%. I was also able to decrease the Brier Score, which is a metric for probability prediction, by 5%.

As you can see from the results, oversampling can significantly boost your model performance when you have to deal with imbalanced datasets using oversampling. In my case, the ProWSyn version of SMOTE performed the best but this depends always on the data and you should try different versions to see which one works the best for you.

Most Data Science: oversampling methods lack a proper process of assigning correct weights for minority samples, in this case regarding the classification of Sexual Harassment cases. This results in a poor distribution of generated synthetic samples. Proximity Weighted Synthetic Oversampling Technique (ProWSyn) generates effective weight values for the minority data samples based on the sample’s proximity information, i.e., distance from the boundary which results in proper distribution of generated synthetic samples across the minority data set.

After the prediction, the histogram of predicted probabilities looks like the image above. The distribution turned out the be the way I imagined. The model has learned from the many features and it turns out there is a correlation within the feature space which at the end creates such a distinct difference between classes 0 and 1. In simpler terms, there is a pattern within 0 and 1 classes’ features.

More care has to be put into probabilities really close to 1 (100% probability). From the histogram plot above, we can see that the number of points near 100% probability is quite high. It is normal to dismiss someone as a non-predatory but much harder to accuse someone, therefore that number should be lower.

Original post can be viewed here.

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- Data Science Leadership Exchange: Best Practices for Driving Outcomes

Despite an increasing awareness of the role data science plays in successful business outcomes, data science leaders still struggle to organize, implement and communicate effective data science initiatives.

Join this latest DSC webinar and gain advice on optimizing your data management strategies. Some of the industry’s best and brightest from Bayer, S&P Global and Transamerica will be presenting their insights and experiences. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- Data Science Leadership Exchange: Best Practices for Driving Outcomes

Despite an increasing awareness of the role data science plays in successful business outcomes, data science leaders still struggle to organize, implement and communicate effective data science initiatives.

Join this latest DSC webinar and gain advice on optimizing your data management strategies. Some of the industry’s best and brightest from Bayer, S&P Global and Transamerica will be presenting their insights and experiences. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central