5 Comments

Summary:

Hadoop, as a pivotal piece of the data mining renaissance, offers the ability to tackle large data sets in ways that weren’t previously feasible due to time and dollar constraints. But Hadoop can’t do everything quite yet, especially when it comes to real-time work flow. Fortunately, […]

hadoop-logoHadoop, as a pivotal piece of the data mining renaissance, offers the ability to tackle large data sets in ways that weren’t previously feasible due to time and dollar constraints. But Hadoop can’t do everything quite yet, especially when it comes to real-time work flow. Fortunately, a couple of innovative efforts within the Hadoop ecosystem, such as Hypertable and HBase, are filling the gaps, while at the same time providing a glimpse as to where Hadoop’s full capabilities might be headed.

Currently Hadoop is used by companies to analyze big data, typically with records and counts in the billions. These staggering numbers often come from Internet applications at scale and the resulting log data. But going through the log data happens in batch jobs outside the standard work flow of the web site, not upon every search. As an example, a search site might track each search query and the terms entered. If it then wants to examine the most popular requests to provide an “assist or suggest” capability, it can use Hadoop to crunch through a year or two of prior searches, resulting in a list of popular combinations and frequency. This multistep process takes time, however, and cannot execute upon every user search.

But using distributed, scalable, structured storage models such as Hypertable and HBase, a site could retain the results of the Hadoop process and provide instantaneous lookups to the large list. Meaning that as you type, a lookup based on each successive letter would suggest the most frequent completions of those searches. Since a year or two worth of searches will return clear patterns, the suggestions and auto-fill results are usually excellent. “Where Hadoop and MapReduce excel in crunching huge data sets, there is also a need to store a large volume of small data pieces,” explains Ryan Rawson, an HBase contributor and senior software developer at StumbleUpon. “HBase fills this gap with a random read, random write, real-time, small data access system that leverages HDFS for distributed storage.”

Hypertable and HBase accomplish this task as column-oriented stores, a very different approach than those taken by the relational databases often used to keep track of web application data. Both Hypertable and HBase are modeled after Google’s implementation of Big Table, which powers much of Google’s infrastructure. This is a bit different from other well-known key-value stores, like Dynamo from Amazon, and Project Voldemort, which is supported, at least in part, by LinkedIn.

“While these solutions will never be a 100 percent replacement for MySQL, they can be important when trying to scale large tables,” Rawson explains further. “When people are thinking of storing over 100 billion distinct rows in a table, and a total data set size of 1.5 terabytes, choices become limited. You can buy larger hardware at diminishing returns, or you can buy expensive, clustered database solutions that quickly break the budget.”

Though similar in concept, Hypertable and HBase use different programming languages, with Hypertable written in C++ and HBase written in Java. This can be a contentious issue for some, as noted in a recent VentureBeat piece about the difference between Google’s implementation of MapReduce, which is written in C++, and Hadoop, the open-source version of MapReduce that’s written in Java. Even though there might be some perceived differences in the performance or efficiency of C++ compared to Java, Java’s developer accessibility and development pace make it a strong choice for certain projects. Indeed, Hadoop’s Java-friendly approach may be one reason it’s developed such a quick and engaged following.

And if the recent growth in usage of Twitter, Friendfeed/Facebook, and mobile applications are any indication, there will be increasing pressure to accelerate the analysis and pattern-matching now carried out by Hadoop to deliver more enriching web user experiences. It will take time, as even the Hadoop experts are honing their skills on predictive models, including figuring out how long each process takes, and how many hardware/software resources will be needed. In the meantime, column-oriented stores like HBase and Hypertable provide a practical mechanism to get one step closer to real time using Hadoop.

You’re subscribed! If you like, you can update your settings

  1. WebDesignExpert.Me Tuesday, September 22, 2009

    I think hadoop may work well in addition to “nosql” database like cassandra, google bigtable etc.

  2. Sailesh Krishnamurthy Tuesday, September 22, 2009

    Gary

    You’re absolutely correct that it’s very hard to accomplish anything
    like a true real-time workflow with Hadoop or indeed with any MPP
    system.

    One approach that you point out is to separate the actual analysis of
    data (say using a Hadoop process) from its use (the instantaneous
    lookup).

    While this is certainly a step in the right direction it solves only
    half of the problem and does not attack the fundamental issue of
    accomplishing the analytics itself in a high-throughput low-latency
    fashion. The reason for this is quite simple: MPP systems like Hadoop
    are designed for very high scalability and accomplish this via
    brute-force parallelism. As a result while it’s possible to get very
    high throughput from such systems, the associated overheads (mainly
    communication) are crippling unless amortized over very large batches
    of data.

    At Truviso (http://www.truviso.com) we are pioneering a continuous
    analytics approach attacks both halves of this problem by decoupling
    the analysis and the use of the data much as you point out with your
    suggestive search example.

    Sailesh Krishnamurthy
    Truviso

  3. Is Hadoop Champion Cloudera the Next Red Hat? Friday, October 2, 2009

    [...] Desktop. It’s a graphical interface for managing Hadoop, the open-source framework that is catalyzing the data mining renaissance. Cloudera’s Hadoop now works on almost all major cloud platforms: Amazon Web Services, [...]

  4. NoSQL meetup, report | Zemanta Ltd. Wednesday, November 4, 2009

    [...] Getting Closer to Real Time With Hadoop (gigaom.com) [...]

  5. When it Comes to Web Scale Go Cheap, Go Custom or Go Home – GigaOM Sunday, March 14, 2010

    [...] For example, an audience member questioned the panel about any good columnar database stores beyond Hadoop, and Kevin Weil from Twitter explained that there were some closed source options out there, but [...]

Comments have been disabled for this post