5 Comments

Summary:

Google today said on its official blog that it has developed optical character recognition technology to the point that its search engine can read any scanned document in Adobe’s PDF format, effectively turning scanned images into words that are searchable and indexable.

Google today said on its official blog that it has developed optical character recognition technology to the point that its search engine can read any scanned document in Adobe’s PDF format, effectively turning scanned images into words that are searchable and indexable. It’s no secret that Google has been looking into OCR; the Mountain View, Calif.-based company’s efforts to make books and newspapers digitally searchable are also related to its broader efforts to expand the parameters of search.

Expanding search to PDF files may be a small step toward digitizing the web, as Google says, but it’s also a step toward throwing more light on the dark web, the mass of data that is many times bigger than the scope of what Google searches currently yield. Data on the dark web is unsearchable because it is password protected, dynamically created, shared on peer-to-peer networks — or because it is in file formats such as PDFs that, until now, weren’t easily searchable.

There has been speculation recently that Google may use its Chrome browser to try to index the dark web, and an extension of search to include private networks may sound sinister to some. But data isn’t always hidden out of privacy concerns, and anyone who needs to share scanned documents across the Internet — and who wants to make those files easily searchable to others — will probably welcome the expanded capability.

You’re subscribed! If you like, you can update your settings

  1. I think this is a great move. There are a lot of pdf docs online and it would be great to be able to search this content.

  2. Yes, I noticed this yesterday at edge-op.org, which has scanned PDF’s of the evidence in the Microsoft Iowa consumer case.

    Try for example the following search:

    office html site:edge-op.org filetype:pdf

    Note that these scanned PDfs are generally of pretty low quality, so there are plenty of errors in the resulting OCR’d text (which you can see if you use Google’s view as HTML option).

    But if you are lucky, your search keyword might have been OCR’d properly …

    Its pretty cool.

  3. Hi Brigid,

    Yesterday I went to do a google search and I found a bunch of new digg-style social voting feature on the results page. I know this has been covered before and screenshots have circulated, but I don’t think anybody has put out a video documenting the experimental new features so I put one together:

    http://www.metacafe.com/watch/1942391/googles_new_social_voting_features/

    Let me know if you or anybody in the community want more info on how the features work and I would be happy to post another video, answer questions or get some screenshots. I’m a big fan of gigaom so I am very interested to see how the community would react to this video.

    Thanks,
    Justin Baker

  4. Google Indexes Scanned Docs – Brings Light To Dark Web | WildBlueSkies – Trends and strategies in Digital Media Friday, October 31, 2008

    [...] Brigid Gaffikin at GigaOm says, this is also a step towards lighting up the dark web – the huge mass of data that is unsearchable [...]

  5. Can You Speak the Language of the 21st Century? — Hidden Business Treasures Sunday, February 15, 2009

    [...] Twitter you might have made your way to this: “Google’s PDF Search Throws Some Light on the Dark Web.” It’s about increasing the effectiveness of PDF search. [...]

Comments have been disabled for this post