Getting Closer to Real Time With Hadoop


hadoop-logoHadoop, as a pivotal piece of the data mining renaissance, offers the ability to tackle large data sets in ways that weren’t previously feasible due to time and dollar constraints. But Hadoop can’t do everything quite yet, especially when it comes to real-time work flow. Fortunately, a couple of innovative efforts within the Hadoop ecosystem, such as Hypertable and HBase, are filling the gaps, while at the same time providing a glimpse as to where Hadoop’s full capabilities might be headed.

Currently Hadoop is used by companies to analyze big data, typically with records and counts in the billions. These staggering numbers often come from Internet applications at scale and the resulting log data. But going through the log data happens in batch jobs outside the standard work flow of the web site, not upon every search. As an example, a search site might track each search query and the terms entered. If it then wants to examine the most popular requests to provide an “assist or suggest” capability, it can use Hadoop to crunch through a year or two of prior searches, resulting in a list of popular combinations and frequency. This multistep process takes time, however, and cannot execute upon every user search.

But using distributed, scalable, structured storage models such as Hypertable and HBase, a site could retain the results of the Hadoop process and provide instantaneous lookups to the large list. Meaning that as you type, a lookup based on each successive letter would suggest the most frequent completions of those searches. Since a year or two worth of searches will return clear patterns, the suggestions and auto-fill results are usually excellent. “Where Hadoop and MapReduce excel in crunching huge data sets, there is also a need to store a large volume of small data pieces,” explains Ryan Rawson, an HBase contributor and senior software developer at StumbleUpon. “HBase fills this gap with a random read, random write, real-time, small data access system that leverages HDFS for distributed storage.”

Hypertable and HBase accomplish this task as column-oriented stores, a very different approach than those taken by the relational databases often used to keep track of web application data. Both Hypertable and HBase are modeled after Google’s implementation of Big Table, which powers much of Google’s infrastructure. This is a bit different from other well-known key-value stores, like Dynamo from Amazon, and Project Voldemort, which is supported, at least in part, by LinkedIn.

“While these solutions will never be a 100 percent replacement for MySQL, they can be important when trying to scale large tables,” Rawson explains further. “When people are thinking of storing over 100 billion distinct rows in a table, and a total data set size of 1.5 terabytes, choices become limited. You can buy larger hardware at diminishing returns, or you can buy expensive, clustered database solutions that quickly break the budget.”

Though similar in concept, Hypertable and HBase use different programming languages, with Hypertable written in C++ and HBase written in Java. This can be a contentious issue for some, as noted in a recent VentureBeat piece about the difference between Google’s implementation of MapReduce, which is written in C++, and Hadoop, the open-source version of MapReduce that’s written in Java. Even though there might be some perceived differences in the performance or efficiency of C++ compared to Java, Java’s developer accessibility and development pace make it a strong choice for certain projects. Indeed, Hadoop’s Java-friendly approach may be one reason it’s developed such a quick and engaged following.

And if the recent growth in usage of Twitter, Friendfeed/Facebook, and mobile applications are any indication, there will be increasing pressure to accelerate the analysis and pattern-matching now carried out by Hadoop to deliver more enriching web user experiences. It will take time, as even the Hadoop experts are honing their skills on predictive models, including figuring out how long each process takes, and how many hardware/software resources will be needed. In the meantime, column-oriented stores like HBase and Hypertable provide a practical mechanism to get one step closer to real time using Hadoop.


Sailesh Krishnamurthy


You’re absolutely correct that it’s very hard to accomplish anything
like a true real-time workflow with Hadoop or indeed with any MPP

One approach that you point out is to separate the actual analysis of
data (say using a Hadoop process) from its use (the instantaneous

While this is certainly a step in the right direction it solves only
half of the problem and does not attack the fundamental issue of
accomplishing the analytics itself in a high-throughput low-latency
fashion. The reason for this is quite simple: MPP systems like Hadoop
are designed for very high scalability and accomplish this via
brute-force parallelism. As a result while it’s possible to get very
high throughput from such systems, the associated overheads (mainly
communication) are crippling unless amortized over very large batches
of data.

At Truviso ( we are pioneering a continuous
analytics approach attacks both halves of this problem by decoupling
the analysis and the use of the data much as you point out with your
suggestive search example.

Sailesh Krishnamurthy

Comments are closed.