Here's an interesting project for future data scientists: designing an algorithm that will
- Correctly identify the hidden code 90% of the time
- Make recommendations to improve captcha systems
Source: Microsoft's CAPTCHA successfully broken
Reverse-engineering a captcha system requires six steps:
- Collect a large number of images so that you have at least 20 representations of each character. You'll probably need to gather more than 1,000 captcha images.
- Filter noise in each image. A simple filter could work as follows: (1) each pixel is replaced by the median color among the neighboring pixels and (2) reduce color depth from 24-bit to 8-bit. Typically, you want to use filters that remove isolated specks and enhance brightness and contrasts.
- Perform image segmentation to identify contours, binarize the image (reduce depth from 8-bit to 1-bit, that is, to black and white), vectorize the image, and simplify the vector structure (a list of nodes and edges saved as a graph structure) by re-attaching segments (edges) that appear to be broken
- Unsupervised clustering step. Extract each connected component from the previous segmentation: each of them should represent a character. Hopefully, you've collected more than a thousand sample characters, with multiple versions for each of the characters of the alphabet. Now have a human being attach a label to each of these connected components, representing characters. The label attached to a character, by the human being, is a letter. Now you have decoded all the captcha's in your training set, hopefully with a 90% success rate or better.
- Machine learning step. Continue to harvest captcha's every day, apply the previous steps, and add new versions of each character to your training set. Your training set gets bigger and better every day. Identify pairs of characters that are difficult to distinguish from each other, and remove sample chars from training sets that cause confusion.
- Use your captcha decoder. For each captcha, extract the chars using step #2 and #3. Then perform supervised clustering to identify which symbol it represents, based on your training set. This operation should take less than one minute per captcha.
Your universal captcha decoder will probably work well with some types of captcha's (blurred letters), and maybe not as well with other captcha's (where letters are criss-crossed by a network of random lines).
Anyway, this interesting exercise will teach you a bit about image processing and clustering. At the end, you should be able to identify features that would make captcha's more robust:
- Use broken letters, e.g. a letter C split into 3 or 4 separate pieces
- Use multiple captcha algorithms, change algorithm each day
- Use special chars in captcha's (parenthesis, commas)
- Create holes in letters
- Encode 2-letter combinations (e.g. ab, ac, ba ...) rather than isolated letters: The attacker will then have to decode hundreds of possible symbols, rather than just 26 or 36, and thus will need a much bigger sample.