Hadoop, an open-source software program that helps process incredibly large data sets, has been generating plenty of buzz. The upcoming Hadoop Summit on June 10 marks a midway point in an eventful year for the technology. Cloudera, a high-profile startup that’s building commercial services around Hadoop, just announced $6 million in funding. A few weeks ago, the Yahoo! (s yhoo) Developer Network revealed a new record in data sorting using the software. And more recently, rumors have emerged that even Microsoft (s msft) is making use of Hadoop through its acquisition of Powerset.
So what’s next for Hadoop? Certainly more adoption and experimentation across all types of customers and markets. It’s gaining recognition as a tool that will have broad implications in areas like retail and biotech. And it isn’t used for real-time queries yet, but just give the tech community some time. We’re in the midst of a data-mining renaissance, and Hadoop is playing a leading role.
What’s behind the software that’s making it the data darling of giants like Yahoo! and Microsoft, and even Amazon (s amzn) Web Services, which offers Hadoop as an add-on to EC2? Hadoop is a top-level Apache project, and is subsequently able to take advantage of huge clusters of computers to produce fast results for queries and more. Yahoo! uses it for Search Assist, the Yahoo!-branded suggestion feature that’s used when typing a search query. (Google calls its similar feature Google Suggest.)
Search Assist delivers real-time suggestions while you type, saving time and providing better results. The suggestions come from analyzing years of log data on every search and the terms used. For a company the size of Yahoo!, this amounts to terabytes of log files a day and hundreds of terabytes a year. Before Hadoop came along, creating the Search Assist database took Yahoo! 26 days. Now it takes 20 minutes.
The Hadoop software framework is made up of two key elements: The Hadoop Distributed File System (HDFS), modeled on Google’s File System, and the Distributed Processing Framework, which implements a version of MapReduce, the algorithm popularized by Google to partition compute jobs out to hundreds or thousands of nodes. While designed to work in concert, the Hadoop processing framework does not explicitly require HDFS, and sometimes other file systems are used. Hadoop runs on commodity hardware and multiple operating systems.
With Yahoo!’s Search Assist database, the words “Florida vacation” might seem like a frequent combination, for instance. How does Hadoop process the data to suggest the word following “Florida”? It imports log files containing user searches, partitions those files to individual servers, and finds word commonality, mapping the task out to many machines. But as the tallying occurs on different machines, the results still need consolidation. Individual Map nodes send results like “Florida vacation, 6 times” to a Reduce node, culminating in a giant list of word groupings and corresponding frequencies. Another MapReduce operation reorders the list to frequency and then phrase, removing less frequent queries, and speeding the lookup of popular word combinations suggested by Search Assist.
This is just one of hundreds of Hadoop applications. In general, for extremely large data sets requiring manipulation and analysis, Hadoop — and one of several add-on programming languages — can easily do the job.
Hadoop was initially developed by Doug Cutting, an open-source expert and now a Yahoo! employee. And while the software itself is likely to flourish in the open-source community, there are still plenty of opportunities to put the capabilities into the hands of more companies and users.
Special thanks to Milind Bhandarkar (bio) and Ajay Anand at Yahoo! for their help on researching this piece.