Here's an interesting project for future data scientists: designing an algorithm that will

- Correctly identify the hidden code 90% of the time
- Make recommendations to improve captcha systems

*Source: Microsoft's CAPTCHA successfully broken*

Reverse-engineering a captcha system requires six steps:

- Collect a large number of images so that you have at least 20 representations of each character. You'll probably need to gather more than 1,000 captcha images.
**Filter**noise in each image. A simple filter could work as follows: (1) each pixel is replaced by the median color among the neighboring pixels and (2) reduce color depth from 24-bit to 8-bit. Typically, you want to use filters that remove isolated specks and enhance brightness and contrasts.- Perform
**image segmentation**to identify contours, binarize the image (reduce depth from 8-bit to 1-bit, that is, to black and white), vectorize the image, and simplify the vector structure (a list of nodes and edges saved as a graph structure) by re-attaching segments (edges) that appear to be broken **Unsupervised clustering**step. Extract each connected component from the previous segmentation: each of them should represent a character. Hopefully, you've collected more than a thousand sample characters, with multiple versions for each of the characters of the alphabet. Now have a human being attach a label to each of these connected components, representing characters. The label attached to a character, by the human being, is a letter. Now you have decoded all the captcha's in your training set, hopefully with a 90% success rate or better.**Machine learning**step. Continue to harvest captcha's every day, apply the previous steps, and add new versions of each character to your training set. Your training set gets bigger and better every day. Identify pairs of characters that are difficult to distinguish from each other, and remove sample chars from training sets that cause confusion.- Use your captcha decoder. For each captcha, extract the chars using step #2 and #3. Then perform
**supervised clustering**to identify which symbol it represents, based on your training set. This operation should take less than one minute per captcha.

Your universal captcha decoder will probably work well with some types of captcha's (blurred letters), and maybe not as well with other captcha's (where letters are criss-crossed by a network of random lines).

Note that some attackers have designed technology to entirely bypass the captcha: their system does not even "read" them, it gets the right answer each time: they access the server at a deeper level, and read what the correct answer should be, then feed the web form with the correct answer for the captcha. We've seen spam technology that can bypass the most challenging questions in sign-up forms, such as factoring a product of two very large primes (more than 2,000 digits each) in less than one second. Of course they don't extract the prime factors, instead they read the correct answer straight out of the compromised servers, JavaScript code or web pages.

Anyway, this interesting exercise will teach you a bit about image processing and clustering. At the end, you should be able to identify features that would make captcha's more robust:

- Use broken letters, e.g. a letter C split into 3 or 4 separate pieces
- Use multiple captcha algorithms, change algorithm each day
- Use special chars in captcha's (parenthesis, commas)
- Create holes in letters
- Encode 2-letter combinations (e.g. ab, ac, ba ...) rather than isolated letters: The attacker will then have to decode hundreds of possible symbols, rather than just 26 or 36, and thus will need a much bigger sample.

**Related Articles**

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central