Here's an interesting project for future data scientists: designing an algorithm that will

  • Correctly identify the hidden code 90% of the time
  • Make recommendations to improve captcha systems

Source: Microsoft's CAPTCHA successfully broken

Reverse-engineering a captcha system requires six steps:

  1. Collect a large number of images so that you have at least 20 representations of each character. You'll probably need to gather more than 1,000 captcha images.
  2. Filter noise in each image. A simple filter could work as follows: (1) each pixel is replaced by the median color among the neighboring pixels and (2) reduce color depth from 24-bit to 8-bit. Typically, you want to use filters that remove isolated specks and enhance brightness and contrasts.
  3. Perform image segmentation to identify contours, binarize the image (reduce depth from 8-bit to 1-bit, that is, to black and white), vectorize the image, and simplify the vector structure (a list of nodes and edges saved as a graph structure) by re-attaching segments (edges) that appear to be broken 
  4. Unsupervised clustering step. Extract each connected component from the previous segmentation: each of them should represent a character. Hopefully, you've collected more than a thousand sample characters, with multiple versions for each of the characters of the alphabet. Now have a human being attach a label to each of these connected components, representing characters. The label attached to a character, by the human being, is a letter. Now you have decoded all the captcha's in your training set, hopefully with a 90% success rate or better. 
  5. Machine learning step. Continue to harvest captcha's every day, apply the previous steps, and add new versions of each character to your training set. Your training set gets bigger and better every day. Identify pairs of characters that are difficult to distinguish from each other, and remove sample chars from training sets that cause confusion.
  6. Use your captcha decoder. For each captcha, extract the chars using step #2 and #3. Then perform supervised clustering to identify which symbol it represents, based on your training set. This operation should take less than one minute per captcha.

Your universal captcha decoder will probably work well with some types of captcha's (blurred letters), and maybe not as well with other captcha's (where letters are criss-crossed by a network of random lines).

Note that some attackers have designed technology to entirely bypass the captcha: their system does not even "read" them, it gets the right answer each time: they access the server at a deeper level, and read what the correct answer should be, then feed the web form with the correct answer for the captcha. We've seen spam technology that can bypass the most challenging questions in sign-up forms, such as factoring a product of two very large primes (more than 2,000 digits each) in less than one second. Of course they don't extract the prime factors, instead they read the correct answer straight out of the compromised servers, JavaScript code or web pages.

Anyway, this interesting exercise will teach you a bit about image processing and clustering. At the end, you should be able to identify features that would make captcha's more robust:

  • Use broken letters, e.g. a letter C split into 3 or 4 separate pieces
  • Use multiple captcha algorithms, change algorithm each day
  • Use special chars in captcha's (parenthesis, commas)
  • Create holes in letters
  • Encode 2-letter combinations (e.g. ab, ac, ba ...) rather than isolated letters: The attacker will then have to decode hundreds of possible symbols, rather than just 26 or 36, and thus will need a much bigger sample.

Related Articles

Views: 2810


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Dr. Z on November 4, 2013 at 6:21am

Very interesting. For a moment I thought this was going to be an article about hacking, but it also entails counter-hacking strategies.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service