Blog Post

Big data in real time is no fantasy

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Big data — as in managing and analyzing — large volumes of information, has come a long way in the past couple of years. Among the greatest innovations might be the advent of real-time analytics, which allow the processing of information in real time to enable instantaneous decision-making. Even Hadoop, the set of parallel-processing tools that has become the face of big data, but which has been historically limited to batch processing, is coming along for the ride.

Analytics are nothing new, but Hadoop had made organizations of all types realize they can analyze all their data and do so using commodity servers with local storage. They can extract valuable business insights from sources like social media comments, web pages and server log files.

Because of its parallel nature and ability to scale across thousands of nodes, Hadoop makes short work of even terabytes of information that might have taken days to process using traditional methods. But not short-enough work for some situations.

Yahoo (s yhoo) CTO Raymie Stata explained the current state of affairs in a recent article at The Register:

With the paths that go through Hadoop [at Yahoo!], the latency is about fifteen minutes. … [I]t will never be true real-time. It will never be what we call “next click,” where I click and by the time the page loads, the semantic implication of my decision is reflected in the page.

However, thanks to various Hadoop optimizations, complementary technologies and advanced algorithms, real-time analytics are becoming a real possibility. The goal for everyone seeking real-time analytics is to have their services act immediately — and intelligently — on information as it streams into the system.

Pick a platform

Yahoo itself is working on a couple of real-time analytics projects, including S4, which we’ve profiled here, and MapReduce Online. Appistry and Accenture (s acn) teamed up late last year to create a product called Cloud MapReduce. DataStax’s Brisk Hadoop distribution analyzes and stores data within the same Cassandra NoSQL database on the same system, so applications can access and serve Hadoop-processed data much faster than using separate storage systems.

This week, a startup called HStreaming launched its eponymous product, which actually is based on Hadoop. Whereas Yahoo is focused on web behavior, HStreaming lays out the following examples in its press release:

Typical examples include location information, sensor data, or log files when the traditional model of store-and-process-later is not fast enough for such data volumes. Companies need to react promptly to sensor readings or analyze web logs as they are generated because that type of information becomes quickly obsolete.

Others are using real-time analysis to make targeted advertising bot instantaneous and super-efficient. I spoke yesterday with Eric Wheeler, founder and CEO of 33Across, a marketing platform that lets companies target potential customers based on those companies’ social graphs. Essentially, he explained, “We constantly re-score the brand graph to understand who are the best targets for that ad right now. We use social connections of that brand to know whom we should next target.”

In order to do this, 33Across maintains a “massive Hadoop implementation” complemented by machine-learning and predictive-analytics algorithms that has developed in-house. Presumably, the data batch processed by, and stored in, the Hadoop cluster adds context to streaming data as it hits the 33Across system. The more data it has about a brand’s social graph, the better decision it can make on the fly.

Jeff Jonas, chief scientist of IBM’s (s ibm) Entity Analytics division and all-around big-data genius, analogizes this effect to putting to together a puzzle. The more pieces you have in place, the easier it is to figure out where the next piece goes. Within the context of IBM’s big data portfolio, for example, Hadoop helps companies learn their past, which helps real-time products such as InfoSphere Streams or Jonas’s Entity Analytics software analyze streaming data more accurately.

Real-time advertising was also the impetus behind’s recent partnership with Triggit. Amazon (s amzn) wants to use its data to make money by helping other web sites better target incoming visitors as they browse from site to site. Thanks to Triggit’s predictive algorithms and cookie-analysis system, “Amazon [can] show the right ads to the right users across nine ad exchanges and more than four million websites.”

If this all sounds like high computer science, it is. But the most interesting thing about it might be that it was hardly even possible a few years ago. According to Wheeler, the tools and best practices — and in some cases, the data — weren’t readily available until recently, so the evolution from batch processing to real-time processing has happened quickly.

But we’re only “in the first inning of a doubleheader,” he said, so real-time processing will only get better as data volumes increase and models get more finely tuned.

Image courtesy of Flickr user RL Fantasy Design Studio.

14 Responses to “Big data in real time is no fantasy”

  1. Great article, and thank for covering the space. I have to respectfully disagree with the premise that real-time analytics is something that is just now becoming real. Nearly every major financial firm relies on proprietary real-time trading (aka low latency) that is driven by real time streaming analytics of ticker data. My company PatternBuilders ( has our PatternBuilders Analytics Framework (, a streaming analytics framework that we have applied across multiple domains including ticker data, retail and social media successfully. Microsoft has StreamInsight specifically for this class of problems as do vendors who are focused on Complex Event Processing.
    Hadoop/Map Reduce is a great technology but it is fundamentally a batch oriented system and batch systems are not and IMO will never be “real time” or even close to it. It was to get away from the batch oriented nature of Hadoop style processing that led Google to move away from Map Reduce when they revamped their search architecture (code named Caffeine).



    • Derrick Harris


      I can’t argue with the fact that banks have been doing real-time analytics for years, but I’ll suggest that the technologies and systems they had in place were hardly widely accessible. As I acknowledge in the post, Hadoop is a batch-oriented technology, which is why many efforts to do real-time don’t use it (at least for the real-time processing part). It’s interesting, though, that Hadoop is the basis of multiple real-time efforts, including within Yahoo. Hadoop or otherwise, the real story is democratization first of big data tools in general, and now, more recently, of real-time capabilities.

      • Derrick,

        I couldn’t agree more that the big story in big data is that it is moving from being the exclusive province of huge, technology sophisticated companies with piles of cash (Goldman, Facebook, Wal-Mart, etc) to the rest of the world.

        You are also absolutely right that a lot of smart people are trying to make Hadoop into a real-time/streaming analytics system. It would be great to see a follow on article on why they choose that approach – I have a huge amount of respect for the Hadoop technology and the community that supports it, but unless I am missing something this is a classic case of the old saying – if your favorite tool is a hammer then every problem looks like a nail.

        Thanks again for the great article.



  2. Jake Flomenberg

    At Splunk we agree that the term real-time is often abused as it relates to actionable insight. We see that there is a spectrum of processing ranging from instantaneous to batch and the value that a given level of responsiveness affords is determined largely by the use case.

    If you don’t take programmatic action on the data then milliseconds just don’t matter. If this action powers which ad or what screen a user next sees then that’s tremendously valuable. The same goes when you need to respond to a request or risk impacting the service experience as perceived by the customer. However, if the data is simply to update a dynamic display, then it can wait a second or two. A few second lag on a “real-time” leader board probably doesn’t make much difference.

    Another dimension of the problem that we’re focusing on at Splunk is ease of use – how can we enable more real-time Big Data use cases without always having to write code? We try to make it easy to create alerts, power dashboards, and even explore your data in real-time. For some use cases you will definitely need a data guru but for others, executives and analysts should be able to interact and explore real-time data on their own.

  3. I’d have to agree with other comments about what is/not “real-time”. Managing Big Data in real-time is all about ingesting high velocity input streams, giving that data well-organized state, and adding value to support real-time analytics (e.g., in the form of materialized views for leader boards and real-time aggregations). While Hadoop is great for munging through petabytes of history, critical insights can be gained by analyzing data while it’s in its most volatile state. Some have predicted that the new-age data tier will consist of interoperable database engines for handling the full range of Big Data requirements.

    What are examples of fire hose data sources? Tweet streams, sensor-generated data, micro-transactional systems such as those found in online gaming and digital ad exchanges, to name a few.

    Solutions such as VoltDB (disclosure: I work at the Company) have emerged specifically to handle the real-time requirements of Big Data. They offer ultra-high throughput and milliseconds latency, near linear scale, relational semantics and built-in interoperability with deep analytic infrastructures such as Hadoop.

      • Hi Vladimir. Your customers and ours seem to have different definitions of “Big Data”.
        Historically, most data came from business transactions. Today, an increasing proportion of data comes from automated and sensor-driven applications. Although it’s true that this data may eventually find its way into a permanent (Exabytes+) datastore, customers often place great value in analyzing it in its initial high velocity state.
        If I’m not mistaken, the point of this article is that customers increasingly do wish to perform real-time (or near-time, or whatever other term applies) analytics on fast-moving data. Others have pointed out that warehouse-style datastores do not necessarily offer the data currency needed for these analytic operations. That’s where other purpose-built solutions like VoltDB can help.
        We (VoltDB) believe the best way to handle many forms of high velocity data is by storing it in main memory using a shared nothing scale-out architecture, where it can be enriched and analyzed. As that data ages, we believe it should be passed off to a long term disk based datastore (e.g., HDFS, OLAP systems, etc.), possibly after further enrichment. Customers should use the best tools for their needs at different stages of the data lifecycle, whether you refer to that lifecycle as Big Data or something else.

    • Derrick Harris


      I’m not certain anyone mentioned 15 minutes as being real time. I quoted Raymie from Yahoo saying 15 minutes is about as fast as they’re doing Hadoop, but it’s not fast enough for certain uses.

  4. This is very interesting, however, I would say that the biggest jump has happened in the popularization among low budget process consumers of the full text search engine tools that are freely available. For almost nothing, we used here: opensource tools that allowed us to process huge amounts of data. Like us, many others are using this almost instantaneous results engines.

  5. Derrick finally someone writes about big data in the terms the big data guys can related to. We use HBase but can says with pulling down 100’s of 1000’s of brand conversations based on social affinity Hadoop didn’t work for us. I agree with Bill on the Cloudscale and def think that could be a move for some other things we do. Would value your feedback on social intelligent and curation using crowdsourcing to find what content is trending. Were using live data streams and massive data sets to get to the consolidate view on a per brand bases here is @chasemcmichael Looking forward to connecting.

    • Derrick Harris

      Bill and Jud,

      I hope I didn’t give the impression that the offerings I cited were the only ones available — they’re just some of the ones I’ve personally covered. The space is growing quite fast, actually. Soon, some project or product should become a household name like Hadoop has become.

  6. Great article, Derrick. At Cloudscale we’ve developed the world’s fastest realtime analytics platform for big data, and we’re now deploying it to provide massive competitive advantage for some of the world’s leading global corporations (Fortune 50). Cloudscale is 100X-1000x faster than Hadoop or S4, enabling deep analytics on millions of transactions/events per second. The patent-pending architecture is in-memory, massively parallel, bulk synchronous, and super-fast! It enables business users and IT teams to run continuous realtime analytics on live data streams and massive data sets, and to go from raw data to decision to action in seconds. For those interested in what’s under the hood, the implementation of the Cloudscale realtime engine is C++ and MPI, with smart compression for super-fast node-to-node communications. We offer cloud, cloud cluster or in-house cluster deployment for standard, high, or extreme performance.