Update: The robots.txt files have been updated, and crawlers are now being allowed.
Update: Mohit Hira, Director of Marketing at Indiatimes, informs us that the company was in the midst of maintenance and upgrades, which required them to temporarily block all crawlers until today. These crawlers will be reinstated in phases today – as was the plan.
Original Story: It appears that BCCL owned Times Internet Ltd, which operates the websites for India’s largest English dailies the Economic Times, Times of India is preventing all web crawlers from accessing content from the websites. That means that sites like Google (NSDQ: GOOG) News, and indeed Google, Yahoo (NSDQ: YHOO) and the recently launched In.com aggregator from competitor Web18 will not be able to access and index news content from the Economic Times and Times of India.
Crawlers follow a Robots Exclusion Protocol (or robots.txt protocol) which “is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable.” Here’s the content of the robots.txt files at Indiatimes’ portals:
User-agent: * # applies to all robots
Disallow: / # disallow indexing of all pages
See for yourself, the robots.txt files for Times Of India, Economic Times, Indiatimes Cricket and Sports at Indiatimes. I’m wondering whether this is by mistake or by design – has TIL decided to block all crawlers or is this a response to Times Internet content being indexed by Web18s meta aggregator In.com? Traffic from search is significant for media portals (particularly with Google’s QDF tweak to search listings), and surely Times Internet will lose traffic by preventing indexing of its pages. We’ll try and get details from Times Internet.