Google’s revamped its search indexing methodology earlier this week, which was quickly eclipsed by chatter about it adding (and then taking away) a background image on its home page. But the Jeff Koons and Dale Chihuly pictures were red herrings, distracting us from some really big technology changes that could influence those trying to deliver the real-time web in the years to come, much like the company’s MapReduce and Google File Systems inspired the creation of Hadoop.
Google late Wednesday announced that its search engine was now running on “Caffeine,” meaning search results would show more recent information posted on the web as the technology behind it indexes results in seconds rather than weeks. But the allure of Caffeine isn’t just the actual real-time search engine, it’s the secret infrastructure sauce that Google uses to make a continuing real-time index of the entire web possible. That technology involves custom-built hardware and software that plenty of startups or business analytics companies would love to get their hands on.
I’m not going to be much help in telling folks anything on the hardware side because Matt Cutts, Google’s search guru, was fairly tight-lipped in an interview with me. He described the new indexing process as a cache and batch process job, whereby files aren’t stored for long before they’re indexed. Such an effort, especially on the scale of petabtyes as Cutts described, would involve bringing the memory close to the physical compute, likely through SSDs and faster interconnects on the hardware side.
But even though Google’s custom-built hardware is a source of speculation, it’s the software that’s making it possible to exploit the power of commodity parts in Google’s custom gear. And it’s the software that really gives Google the advantage. Cutts admitted last year when the company first announced its Caffeine efforts that the software underpinning it all was a next-generation version of the Google File System dubbed GFS2. From a report in The Register written last year detailing the changes in GFS2:
But GFS supports some applications better than others. Designed for batch-oriented applications such as web crawling and indexing, it’s all wrong for applications like Gmail or YouTube, meant to serve data to the world’s population in near real-time.
“High sustained bandwidth is more important than low latency,” read the original GPS research paper. “Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response-time requirements for an individual read and write.”
The GFS2 that underpins Caffeine has multiple masters and slaves so it can scale out and offer redundancy. It also breaks up the amount of data stored in each file from 64MB to 1MB, according to the information cited by The Register. Cutts explained to me that the Google Caffeine allows for a better ability to annotate smaller amounts of data and attach it to relevant files. It’s only logical that shrinking the overall chunks of data makes for more granular information retrieval, but it can also make updating only certain aspects of a site possible, by extension reducing the amount of time spent processing and being sent around the data center.
Google’s solutions for scaling up to deliver information and web services cheaply and quickly everywhere have influenced the way coders have built their own webscale services, even though Google has kept code like BigTable proprietary. As it seeks to optimize its code for the real-time web — and reduce latency across all products — it may continue to influence folks, even as startups like Twitter, Northscale, Facebook and others are seeking ways to stay on top of the real-time flow of information and offering their own efforts to the open source community.
This article also appeared on BusinessWeek.com