Blog Post

Google's PDF Search Throws Some Light on the Dark Web

Google today said on its official blog that it has developed optical character recognition technology to the point that its search engine can read any scanned document in Adobe’s PDF format, effectively turning scanned images into words that are searchable and indexable. It’s no secret that Google has been looking into OCR; the Mountain View, Calif.-based company’s efforts to make books and newspapers digitally searchable are also related to its broader efforts to expand the parameters of search.

Expanding search to PDF files may be a small step toward digitizing the web, as Google says, but it’s also a step toward throwing more light on the dark web, the mass of data that is many times bigger than the scope of what Google searches currently yield. Data on the dark web is unsearchable because it is password protected, dynamically created, shared on peer-to-peer networks — or because it is in file formats such as PDFs that, until now, weren’t easily searchable.

There has been speculation recently that Google may use its Chrome browser to try to index the dark web, and an extension of search to include private networks may sound sinister to some. But data isn’t always hidden out of privacy concerns, and anyone who needs to share scanned documents across the Internet — and who wants to make those files easily searchable to others — will probably welcome the expanded capability.

5 Responses to “Google's PDF Search Throws Some Light on the Dark Web”

  1. Justin Baker

    Hi Brigid,

    Yesterday I went to do a google search and I found a bunch of new digg-style social voting feature on the results page. I know this has been covered before and screenshots have circulated, but I don’t think anybody has put out a video documenting the experimental new features so I put one together:

    Let me know if you or anybody in the community want more info on how the features work and I would be happy to post another video, answer questions or get some screenshots. I’m a big fan of gigaom so I am very interested to see how the community would react to this video.

    Justin Baker

  2. Yes, I noticed this yesterday at, which has scanned PDF’s of the evidence in the Microsoft Iowa consumer case.

    Try for example the following search:

    office html filetype:pdf

    Note that these scanned PDfs are generally of pretty low quality, so there are plenty of errors in the resulting OCR’d text (which you can see if you use Google’s view as HTML option).

    But if you are lucky, your search keyword might have been OCR’d properly …

    Its pretty cool.