Comments Off


Google revamped its search indexing methodology this week, which was quickly eclipsed by the chatter about background images on its home page. But those images were a red herring distracting us from technology changes that could influence those delivering the real-time web for years to come.

Google’s revamped its search indexing methodology earlier this week, which was quickly eclipsed by chatter about it adding (and then taking away) a background image on its home page. But the Jeff Koons and Dale Chihuly pictures were red herrings, distracting us from some really big technology changes that could influence those trying to deliver the real-time web in the years to come, much like the company’s MapReduce and Google File Systems inspired the creation of Hadoop.

Google late Wednesday announced that its search engine was now running on “Caffeine,” meaning search results would show more recent information posted on the web as the technology behind it indexes results in seconds rather than weeks. But the allure of Caffeine isn’t just the actual real-time search engine, it’s the secret infrastructure sauce that Google uses to make a continuing real-time index of the entire web possible. That technology involves custom-built hardware and software that plenty of startups or business analytics companies would love to get their hands on.

I’m not going to be much help in telling folks anything on the hardware side because Matt Cutts, Google’s search guru, was fairly tight-lipped in an interview with me. He described the new indexing process as a cache and batch process job, whereby files aren’t stored for long before they’re indexed. Such an effort, especially on the scale of petabtyes as Cutts described, would involve bringing the memory close to the physical compute, likely through SSDs and faster interconnects on the hardware side.

But even though Google’s custom-built hardware is a source of speculation, it’s the software that’s making it possible to exploit the power of commodity parts in Google’s custom gear. And it’s the software that really gives Google the advantage. Cutts admitted last year when the company first announced its Caffeine efforts that the software underpinning it all was a next-generation version of the Google File System dubbed GFS2. From a report in The Register written last year detailing the changes in GFS2:

But GFS supports some applications better than others. Designed for batch-oriented applications such as web crawling and indexing, it’s all wrong for applications like Gmail or YouTube, meant to serve data to the world’s population in near real-time.

“High sustained bandwidth is more important than low latency,” read the original GPS research paper. “Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response-time requirements for an individual read and write.”

The GFS2 that underpins Caffeine has multiple masters and slaves so it can scale out and offer redundancy. It also breaks up the amount of data stored in each file from 64MB to 1MB, according to the information cited by The Register. Cutts explained to me that the Google Caffeine allows for a better ability to annotate smaller amounts of data and attach it to relevant files. It’s only logical that shrinking the overall chunks of data makes for more granular information retrieval, but it can also make updating only certain aspects of a site possible, by extension reducing the amount of time spent processing and being sent around the data center.

Google’s solutions for scaling up to deliver information and web services cheaply and quickly everywhere have influenced the way coders have built their own webscale services, even though Google has kept code like BigTable proprietary. As it seeks to optimize its code for the real-time web — and reduce latency across all products — it may continue to influence folks, even as startups like Twitter, Northscale, Facebook and others are seeking ways to stay on top of the real-time flow of information and offering their own efforts to the open source community.

This article also appeared on

You’re subscribed! If you like, you can update your settings

  1. Building Better Web Sites » Google Changes Search Algorithm – You’ve Been Warned Monday, June 14, 2010

    [...] owners are going to get a piece of bad news that I have been warning about for a long time now.  This Wednesday, Google launched a new search algorithm, called Caffeine, and they masked this by simultaneously launching a new feature that lets you add a background [...]

  2. OneRiot Adds Facebook Results to its Real-Time Search Friday, June 18, 2010

    [...] advertisers are focusing their attention on getting more “real time” results. Google recently launched a new version of its search index called Caffeine, which the company says now produces results that are 50 [...]

  3. Google Gives Image Search an Overdue Refresh Tuesday, July 20, 2010

    [...] increasing pressure to make search more real-time, the new Google image search does not include a way to sort for the most recent images, though [...]

  4. Google Instant Turbocharges Search « Wednesday, September 8, 2010

    [...] made more than 500 changes to ranking and the user itnerface already in 2010, she said, including Caffeine (a faster system for indexing the web), Google real-time search, spelling corrections, answers on [...]

  5. Is Yahoo Set to Open-Source Real-Time MapReduce?: Cloud « Wednesday, November 3, 2010

    [...] a platform for developing real-time MapReduce applications. As we’ve seen with Google’s new Caffeine infrastructure for its Instant Search features, as well other “NoHadoop” tools, there’s a [...]

  6. #caffeine , #google ‘s next-gen #mapreduce for #realtime indexing

Comments have been disabled for this post