Friday, October 31, 2008

Google is doing OCR on PDF wrapped document scans

The Google blog struggles to explain why their latest technical achievement is important...
Official Google Blog: A picture of a thousand words?

... We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format... This is a small but important step forward in our mission of making all the world's information accessible and useful...
So why is this important, yet hard to appreciate?

The first problem is that most people think of PDF as a text container. Indexing a text container is nothing special. What's less appreciated is that PDF is the de facto standard way to package a scanned document [1].

So what's novel about doing character recognition on a scan? OCR on 600 dpi B&W document scans is no great trick. Adobe's PDF client has more or less done that for about 10 years [2], and Windows' (formerly Xerox) ancient and under-appreciated document imaging has had this ability since the dawn of time.

The trick is implementing this affordably on millions and billions of PDFs indexed on Google's servers.

That's impressive and it is going to open a vast amount of knowledge.

Good for you Google!

[1] I figured this would happen in the 90s before there was any clear answer to the scanned document representation question. There are bizarre technical issues with scanning into PDF, but it's a great format overall. (Hint: Ancient fax-style lossless compression of a "B&W" document scan is much more efficient than readable JPEG compression of a gray scale scan.)

[2] They sometimes hide or remove this feature depending on what else they're selling.

