In pretty much every discussion about Hadoop, the question arises of who’s using it outside of web companies. You know, in financial services, manufacturing and other vertical markets that signify whether a technology has really made it. It’s a fair question, but somewhat misguided (even when it’s me asking it) because most of these implementations are early-phase, with users just testing the waters. Hadoop was born from the web and it was web companies, with their extreme needs, that showed what the framework is capable of. To see what’s really going on with Hadoop, we should keep looking to the web and, increasingly, to Facebook in particular.
A brief history lesson for the unfamiliar: Hadoop is based on Google’s MapReduce parallel-processing engine and the Google File System, and was developed by Yahoo to power its search engine and analyze the data it collects from its various web operations. Yahoo maintains its own open source Hadoop distribution and is a key contributor to the Apache project. Cloudera also provides its own distribution, as well as a collection of tools that will help get Hadoop into the mainstream. But most IT innovation comes from companies on the bleeding edge, those — like Yahoo several years ago — with such high demands on their infrastructures that existing tools just won’t cut it. Today, that’s Facebook, and it released details last week that suggest the social-network phenomenon might be poised to take the reins of Hadoop development.
A Dec. 10 post on the Facebook Engineering’s Notes page described the latest “rockstars” in the company’s Hadoop code base — AvatarNode, RaidNode and Appends. All of these tools have been open sourced and/or contributed back to Apache. You can read that post for the details, but the gist is that they address some very real problems with Hadoop, including NameNode high availability, storage-volume reduction and synchronization between HDFS and HBase.
It was so important for Facebook to develop these new tools because it relies on Hadoop to handle incredible amounts of data for production applications. As noted in the post, Facebook stores and analyzes a web analytics data warehouse in Hadoop, uses it to back up the Hadoop-based HBase database that underpins its new Messages feature, and uses it to backup its massive MySQL implementation. Facebook maintains several Hadoop clusters in multiple data centers, with its largest spanning 3,000-plus machines and storing 30 petabytes of data. Post author Dhruba Borthakur calls it “possibly the largest single Hadoop cluster in the world.” Based on the data I’ve seen, he might be right. Of Hadoop World 2010 attendees who responded to a survey by Cloudera, the largest clusters appear to be in the 1,000-node and 1-petabyte range (more on this in a forthcoming report.)
Of course, these new features are hardly the biggest contributions Facebook has made to Hadoop. It also is responsible for Hive, the SQL-like query language that is an integral part of the Apache, Yahoo and Cloudera distributions. If Facebook carries through on its HighTide project, which aims to let Hadoop clusters span data centers, Hive might not even be its Facebook’s most-important contribution. Being able to manage smaller, geographically distributed clusters would make operations a lot easier and let users across organizations query the same datasets without replicating them locally.
Who’ll succeed it as the next great Hadoop champion? First it was Google, then it was/is Yahoo and Facebook looks like the next in line. After Facebook, maybe it will be Twitter, which is doing some interesting work of its own as that site grows. Or maybe it will be Amazon Web Services or Google (again), both of which are fusing Hadoop with their cloud computing platforms. A long shot is Apple, which is searching, at least, for a Hadoop engineer to build an ETL infrastructure for its iAds mobile-advertising program. That would require a degree of openness beyond that for which Apple is known, but the company does have a member on the Apache Hadoop Project Management Committee dominated by Yahoo, Facebook and Cloudera.