ReCaptcha: digitizing books word by word

28 07 2009

Raise your hand if you are familiar with the work of Luis von Ahn, Assistant Professor of Computer Science at Carnegie Mellon University. Luis is one the masterminds behind ‘Captcha‘, the program that protects websites from automated spamming. Captcha is nerdspeak for ‘Completely Automated Public Turing Test To Tell Computers and Humans Apart’ and looks like this:

Luis has recently come up with ReCaptcha, a two-flies-with-one-stone program that not only stops spam but also helps to archive human knowledge by digitizing “books, newspapers and old radio shows”. Oh, the brilliance.  Here’s how it works. ReCaptcha uses words that cannot be read correctly by Optical Character Recognition (OCR, the standard way of digitizing human knowledge) as Captchas.

Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.

To see how fast ReCaptcha works, about 20 years of the New York Times archive was digitized in only a few months. 20 years. Read more about it here or watch Luis do some splainin’. I don’t think I have ever been this excited about a computer application since Tetris came out.

[H/T: Judy Sims]


Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s




%d bloggers like this: