3 Comments

Summary:

Content protectionism is heating up. After Rupert Murdoch last year complained about sites that “steal” News Corp (NYSE: NWS) stories, his U…

Content protectionism is heating up. After Rupert Murdoch last year complained about sites that “steal” News Corp (NYSE: NWS) stories, his UK papers this week stopped the British aggregator NewsNow from indexing their sites.

These big walls are being built with one small file – the robots.txt exclusion standard, which lets website owners like News Corp dictate how search sites crawl around them.

Google (NSDQ: GOOG) specifically advises any copy-averse publishers to roll out the robots. And, toward some search engines (though not yet to Google), News Corp is doing just that. Here’s how Murdoch is already using the protocol to block more than just NewsNow – and how other publishers are deploying robots.txt, too…

Times Online, Sun Online and NOTW.co.uk: All now block…
NewsNow
– The Alexa search engine
– discussion search engine Omgili
– Web spider WebVac, used by Stanford University
WebZip, an application to save entire websites offline
PicSearch, a Swedish image search site

WSJ.com: Blocks only MSNPTC 1.0, a spider believed to be for Microsoft’s adCenter which is also barred by Marketwatch.com.

New York Post: Blocks MetaCarta, a service that crawls news stories to extract geographic information for placing on maps and services.

Other habits…

– None of the other main UK newspaper sites (Guardian.co.uk, Telegraph.co.uk, Mail Online, Independent.co.uk and Mirror.co.uk) outright block any search services.

– But some don’t want to show off their mobile versions – Telegraph.co.uk blocks crawling of its cut-down story pages, and FT.com blocks Google’s mobile spider.

Mirror.co.uk uses robots.txt to hide several racy stories that it seems to have removed for legal reasons.

NYTimes.com: Blocks everyone from indexing content it publishes from numerous syndication partners (including paidContent.org) – and doubles up its wall against Google for this purpose by issuing duplicate “disallow” commands toward Googlebot. Also makes a specific point of clearing Google’s AdSense to crawl the site.

WashingtonPost.com: Stops many spiders from crawling photos, syndicated stories, audio and video files, text files and files that make up individual parts of story pages.

USAToday.com: Blocks everyone from crawling sports stories from before 1999, old Election 2008 pages and numerous system files.

  1. The problem with robots.txt is that is it only effective if the spider respects it. It is like leaving your doors unlocked but putting a sign in your window saying “please don’t rob me” and hoping that stops the burglars.

    If any of those sites or search engines wants News Corp’s content, that file is not going to stop them from taking it if they really want it.

    Share
  2. The Mirror’s robotx.txt file is brilliant. It might as well publish a guide to all the times it’s being injuncted or threatened with a libel suit!

    Share
  3. As AJ pointed out, the robots.txt defense only works for honest search engines and aggregators. Other possible techniques I’ve seen used:

    a) Blocking web hosting providers.

    b) Blocking Amazon cloud services (a hotbed of rogue aggregators and 2.0 content thieves).

    c) Blocking Asia wholesale (I know it’s not fair, but for many sites there’s simply more cost in serving Asian customers than there are potential benefits like advertising sales).

    d) Setting up “spider traps” and blocking automatically any IP that falls into them.

    Share

Comments have been disabled for this post