ReCaptcha: digitizing books word by word

28 07 2009

Raise your hand if you are familiar with the work of Luis von Ahn, Assistant Professor of Computer Science at Carnegie Mellon University. Luis is one the masterminds behind ‘Captcha‘, the program that protects websites from automated spamming. Captcha is nerdspeak for ‘Completely Automated Public Turing Test To Tell Computers and Humans Apart’ and looks like this:

Luis has recently come up with ReCaptcha, a two-flies-with-one-stone program that not only stops spam but also helps to archive human knowledge by digitizing “books, newspapers and old radio shows”. Oh, the brilliance.  Here’s how it works. ReCaptcha uses words that cannot be read correctly by Optical Character Recognition (OCR, the standard way of digitizing human knowledge) as Captchas.

Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

To see how fast ReCaptcha works, about 20 years of the New York Times archive was digitized in only a few months. 20 years. Read more about it here or watch Luis do some splainin’. I don’t think I have ever been this excited about a computer application since Tetris came out.

[H/T: Judy Sims]