9 Comments

Summary:

Update: The robots.txt files have been updated, and crawlers are now being allowed.

Update: Mohit Hira, Director of Marketing at Indiatime…

Update: The robots.txt files have been updated, and crawlers are now being allowed.

Update: Mohit Hira, Director of Marketing at Indiatimes, informs us that the company was in the midst of maintenance and upgrades, which required them to temporarily block all crawlers until today. These crawlers will be reinstated in phases today – as was the plan.

Original Story: It appears that BCCL owned Times Internet Ltd, which operates the websites for India’s largest English dailies the Economic Times, Times of India is preventing all web crawlers from accessing content from the websites. That means that sites like Google (NSDQ: GOOG) News, and indeed Google, Yahoo (NSDQ: YHOO) and the recently launched In.com aggregator from competitor Web18 will not be able to access and index news content from the Economic Times and Times of India.

Sure enough, here’s proof: the last Economic Times story indexed by Google News was six days ago – on 21st May 2008. Ditto for the Times of India.

Crawlers follow a Robots Exclusion Protocol (or robots.txt protocol) which “is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable.” Here’s the content of the robots.txt files at Indiatimes’ portals:


User-agent: * # applies to all robots
Disallow: / # disallow indexing of all pages

See for yourself, the robots.txt files for Times Of India, Economic Times, Indiatimes Cricket and Sports at Indiatimes. I’m wondering whether this is by mistake or by design – has TIL decided to block all crawlers or is this a response to Times Internet content being indexed by Web18s meta aggregator In.com? Traffic from search is significant for media portals (particularly with Google’s QDF tweak to search listings), and surely Times Internet will lose traffic by preventing indexing of its pages. We’ll try and get details from Times Internet.

  1. They can decide to block specific crawler like that of "In .com".

    Share
  2. another walled garden, exactly what we need :D

    Share
  3. I don't think that google downloads the robot.txt every day, so it might take a couple of days till it shows any effect

    Share
  4. Hi…wish someone had checked with us before posting this story but what's done is done :-)

    The fact is that we are in the midst of a routine upgrade and maintenance check that required us to temporarily block all crawlers until today. These crawlers will be reinstated in phases today – as was the plan. It would be stupid for TIL to block news aggregators such as Google or anyone else, for that matter. And, contrary to what some Alootechie readers may think, we ain't stupid! Cheers!

    Share
  5. Nikhil Pahwa Tuesday, May 27, 2008

    Mohit – been trying to contact you for a confirmation regarding this. called the indiatimes office, but they weren't able to connect me to corpcomm. Couldn't contact you either. Found a corp-comm number on the net, and that person appears to have left Indiatimes.

    Share
  6. Nikhil Pahwa Tuesday, May 27, 2008

    Oh, and this isn't Alootechie.

    Share
  7. Apologies for mixing up the site names – truly! You know where to find me…anytime. Cheers again!

    Share
  8. Thanks Nitin for this post. I was wondering what was up with TIL and associated sites since I was unable to access content ivia Google search. Has this been going on for a week or so? That is the feeling I get.What was even more puzzling is that they have published an article of mine and I am unable to find it through Google Search.

    Kamla Bhatt

    Share
  9. This combination drug product is intended as a treatment for tension headache.

    It consists of a fixed combination of butalbital, acetaminophen, and caffeine. The role each component plays in the relief of the complex of symptoms known as tension headache is incompletely understood.

    Share

Comments have been disabled for this post