Digging Deeper Into Data With Hadoop


hadoop-logoHadoop, an open-source software program that helps process incredibly large data sets, has been generating plenty of buzz. The upcoming Hadoop Summit on June 10 marks a midway point in an eventful year for the technology. Cloudera, a high-profile startup that’s building commercial services around Hadoop, just announced $6 million in funding.  A few weeks ago, the Yahoo! (s yhoo) Developer Network revealed a new record in data sorting using the software.  And more recently, rumors have emerged that even Microsoft (s msft) is making use of Hadoop through its acquisition of Powerset.

So what’s next for Hadoop? Certainly more adoption and experimentation across all types of customers and markets. It’s gaining recognition as a tool that will have broad implications in areas like retail and biotech. And it isn’t used for real-time queries yet, but just give the tech community some time. We’re in the midst of a data-mining renaissance, and Hadoop is playing a leading role.

What’s behind the software that’s making it the data darling of giants like Yahoo! and Microsoft, and even Amazon (s amzn) Web Services, which offers Hadoop as an add-on to EC2? Hadoop is a top-level Apache project, and is subsequently able to take advantage of huge clusters of computers to produce fast results for queries and more. Yahoo! uses it for Search Assist, the Yahoo!-branded suggestion feature that’s used when typing a search query. (Google calls its similar feature Google Suggest.)

Search Assist delivers real-time suggestions while you type, saving time and providing better results. The suggestions come from analyzing years of log data on every search and the terms used. For a company the size of Yahoo!, this amounts to terabytes of log files a day and hundreds of terabytes a year. Before Hadoop came along, creating the Search Assist database took Yahoo! 26 days. Now it takes 20 minutes.

The Hadoop software framework is made up of two key elements: The Hadoop Distributed File System (HDFS), modeled on Google’s File System, and the Distributed Processing Framework, which implements a version of MapReduce, the algorithm popularized by Google to partition compute jobs out to hundreds or thousands of nodes. While designed to work in concert, the Hadoop processing framework does not explicitly require HDFS, and sometimes other file systems are used. Hadoop runs on commodity hardware and multiple operating systems.

With Yahoo!’s Search Assist database, the words “Florida vacation” might seem like a frequent combination, for instance. How does Hadoop process the data to suggest the word following “Florida”? It imports log files containing user searches, partitions those files to individual servers, and finds word commonality, mapping the task out to many machines. But as the tallying occurs on different machines, the results still need consolidation. Individual Map nodes send results like “Florida vacation, 6 times” to a Reduce node, culminating in a giant list of word groupings and corresponding frequencies. Another MapReduce operation reorders the list to frequency and then phrase, removing less frequent queries, and speeding the lookup of popular word combinations suggested by Search Assist.

This is just one of hundreds of Hadoop applications. In general, for extremely large data sets requiring manipulation and analysis, Hadoop — and one of several add-on programming languages — can easily do the job.

Hadoop was initially developed by Doug Cutting, an open-source expert and now a Yahoo! employee. And while the software itself is likely to flourish in the open-source community, there are still plenty of opportunities to put the capabilities into the hands of more companies and users.

Special thanks to Milind Bhandarkar (bio) and Ajay Anand at Yahoo! for their help on researching this piece.

ms_garyorenstein_018_72dpiGary Orenstein is the author of “IP Storage Networking: Straight to the Core”, host of TheCloudComputingShow.com and was a co-founder of Nishan Systems (now part of Brocade).


Steve Wooledge


Great article. We are certainly seeing interest for MapReduce implementations like Hadoop and Aster Data Systems beyond Web companies. The implications of lower cost ways to conduct compute-rich analysis on “Big Data” are tremendous.

It’s worth noting that the mainstream organization who are looking to harness the power of MapReduce for data-intensive applications, are also asking to leverage their existing IT ecosystem and skill-sets (given budget constraints). Aster Data is focusing on this by integrating MapReduce into our massively-parallel SQL database (which provides enterprise-class features IT organizations expect), as as well as enabling developers to write analytic functions and applications in any language they choose and have them execute in-database. This speeds performance, but also opens the door to a whole new class of developers to solve complex data problems.

Aster just announced support for Microsoft .NET within our In-Database MapReduce framework (the “other 1/2” of custom application developers), which is a big step forward in bringing MapReduce to mainstream businesses:

Hadoop Rocks

Steve, nice segway leading to your product promotion.

Comments are closed.