Official Google Blog: A picture of a thousand words?So why is this important, yet hard to appreciate?
... We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format... This is a small but important step forward in our mission of making all the world's information accessible and useful...
The first problem is that most people think of PDF as a text container. Indexing a text container is nothing special. What's less appreciated is that PDF is the de facto standard way to package a scanned document [1].
So what's novel about doing character recognition on a scan? OCR on 600 dpi B&W document scans is no great trick. Adobe's PDF client has more or less done that for about 10 years [2], and Windows' (formerly Xerox) ancient and under-appreciated document imaging has had this ability since the dawn of time.
The trick is implementing this affordably on millions and billions of PDFs indexed on Google's servers.
That's impressive and it is going to open a vast amount of knowledge.
Good for you Google!
[1] I figured this would happen in the 90s before there was any clear answer to the scanned document representation question. There are bizarre technical issues with scanning into PDF, but it's a great format overall. (Hint: Ancient fax-style lossless compression of a "B&W" document scan is much more efficient than readable JPEG compression of a gray scale scan.)
[2] They sometimes hide or remove this feature depending on what else they're selling.
No comments:
Post a Comment