Google today said on its official blog that it has developed optical character recognition technology to the point that its search engine can read any scanned document in Adobe’s PDF format, effectively turning scanned images into words that are searchable and indexable. It’s no secret that Google has been looking into OCR; the Mountain View, Calif.-based company’s efforts to make books and newspapers digitally searchable are also related to its broader efforts to expand the parameters of search.
Expanding search to PDF files may be a small step toward digitizing the web, as Google says, but it’s also a step toward throwing more light on the dark web, the mass of data that is many times bigger than the scope of what Google searches currently yield. Data on the dark web is unsearchable because it is password protected, dynamically created, shared on peer-to-peer networks — or because it is in file formats such as PDFs that, until now, weren’t easily searchable.
There has been speculation recently that Google may use its Chrome browser to try to index the dark web, and an extension of search to include private networks may sound sinister to some. But data isn’t always hidden out of privacy concerns, and anyone who needs to share scanned documents across the Internet — and who wants to make those files easily searchable to others — will probably welcome the expanded capability.