To some, a web site like Craigslist asking you to verify that you are indeed a human by retyping distorted, nonsensical words is irritating. But the next time you do it, you could be helping to fill in some historical blanks.

NPR ran a story yesterday on Luis von Ahn, assistant professor of computer science at Carnegie Mellon University and one of the guys who helped develop the CAPTCHA technology. The short version: Efforts to digitize (really) old books and newspapers were being hampered by faded ink that confounded OCR software. The solution von Ahn came up with was to use the words that the software couldn’t recognize and insert them into these so-called reCAPTCHAs and use the power of human brains to decipher them. CAPTCHAs serve up two words, one is the security word, the other goes toward the book digitization effort. It sounded interesting, so I called von Ahn to find out more.

Here’s how it works. The New York Times is working to digitize all of its issues starting way back in 1851. It starts by scanning every single page as an image. That’s where reCAPTCHA comes in. It runs two optical character recognition (OCR) programs to turn all of those images of pages into text. Different OCR programs tend to make different mistakes. When the two programs disagree on a word, that word is plucked out and distributed among CAPTCHA security programs spread out across 45,000 web sites like Craigslist and TicketMaster.

Human beings then look at the words as part of the CAPTCHA security measure and do the deciphering by retyping what they think the mangled word is. Depending on the word, as little as two or three people agreeing on what it is is enough to figure it out. The word is then sent back to the New York Times to be reinserted into the text version of the image.

Initially, this project was part of Carnegie Mellon, but von Ahn said that they are spinning out reCAPTCHA as its own company. While The New York Times is paying to use the service, reCAPTCHA is also doing work free of charge for the Internet Archive’s project to digitize every book published before 1980.

But von Ahn is looking beyond just re-typing words as security measures. He says that his team has tried using images and having people type what they see. The problem, von Ahn says, is that people don’t spell very well, so even though the image is of a “cat” people could spell “kat” and not answer the question correctly. ReCAPTCHA is also expanding into audio, and using the audio version of CAPTCHAs to have people listen to and decipher words from garbled old recordings or closed captioning transcriptions.

The idea of taking a necessary evil like spam prevention and turning it into something useful is a good one. Who knew selling my old digital camera on Craigslist was actually an act of historical preservation?



@Ranjit only the first of the two words is a security word – the second word is not a security word – there is no predefined definitive answer there – just the one that most people think is the right interpretation – this has some problems IMO – the OCR could be garbled enough that a majority of the people interpret it as something other than what it is meant to be.

This is a wee bit curious.

When you “decode” and submit a captcha, it is compared against an authoritative string to determine if the data was properly entered.

If this is an illegible captcha, how is that authoritative string generated in the first place? Are there a set of admissible values? Doesn’t that reduce the security value of the captcha?

I have a no-brainer solution to the misspelling. they’ve probably thought of it already. but why don’t they take all the user inputs for a single term/image scan and batch them up. then based on the percentage of duplicates, they can determine what greater number of people see and type, and thus higher accuracy in what actually is displayed to them.

