9 Comments

Summary:

Attention webscale aficionados, Twitter plans to open source its Hadoop-like real-time data processing tool known as Storm. The social service nabbed the code through its acquisition last month of BackType, and says it’s a better tool for processing streams of data.

iStock_000000072805XSmall

Attention webscale aficionados, Twitter says it is planning to open source Storm, its Hadoop-like real-time data processing tool. In a blog post Thursday, the microblogging network said it plans to release the Storm code on Sept. 19 at the Strange Loop event in St. Louis, Mo.

The question is — does the world need another real-time data processing tool? After all there are many tools like HStreaming (using Hadoop), the open source S4 and StreamBase, but the overall analytics market (if you can call it a market) is already fragmented. The Storm code comes from Twitter’s acquisition of BackType last month and seems to be an effort to get folks comfortable parsing data on Twitter.

The post does an excellent job laying out use cases for Storm and hints at more to come. While the code can deal with distributed nodes and huge amounts of data a la Hadoop or Map Reduce, Storm handles jobs that are “infinite.” It’s not for a data processing job with an end point, it’s good for streams of data and continual processing. From the post by Nathan Marz:

Here’s a recap of the three broad use cases for Storm:

  • Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
  • Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
  • Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.

But wait! There’s more! At the end of the post we are assured that there’s more to Storm than the blog post has even defined, which we can learn more about next month at the Strange Loop event. From the post:

I’ve only scratched the surface on Storm. The “stream” concept at the core of Storm can be taken so much further than what I’ve shown here — I didn’t talk about things like multi-streams, implicit streams, or direct groupings. I showed two of Storm’s main abstractions, spouts and bolts, but I didn’t talk about Storm’s third, and possibly most powerful abstraction, the “state spout”. I didn’t show how you do distributed RPC over Storm, and I didn’t discuss Storm’s awesome automated deploy that lets you create a Storm cluster on EC2 with just the click of a button.

So for those anxious to test out a new method of crunching terabytes of real-time data on the fly, get thee to GitHub! And wait.

  1. This seems like an interesting tool.

    Share
  2. Anand Prakash Thursday, August 4, 2011

    There is no similarity between Storm and Hadoop. Hadoop is a batch processing system and this is a real time data processing system. These two are completely different use-cases. I know that real-time data would be so much better – but given that Storm needed to be backed by a data store which does aggregation on each of the bolts – will make it likely very slow for processing trillions of words.
    Storm would be very good for continuous real time stats on say billions of words.

    Share
    1. I am confused among “will make it likely very slow for processing trillions of words.” and “Storm would be very good for continuous real time stats on say billions of words”. Will Storm be good for real time processing or not?

      Share
      1. He said it wouldn’t work on TRILLIONS, and would work on BILLIONS. This is the answer to your question – it will work in real time if your data sets are on the order of billions, but won’t be fast enough to handle trillions.

        Share
    2. Anh-Tuan GAI Sunday, August 7, 2011

      there is no similarity but maybe the first one rely on the other. take a look at http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems

      Share
  3. ” Hadoop-like real-time data processing tool”

    Hadoop is specifically designed for batch processing and not real time processing.

    Share
    1. That’s a bad job on my part with modifiers and punctuation. There should be a comma between Hadoop-like and real-time. As I explain in the post Hadoop is batch and Storm is streamed data.

      Share
  4. Why would you question if another tool is good for the environment? It’s not a direct replica of any of the others, and this technology space is still incredibly immature. The near-real-time stream processing space, in particular, is quite narrow in terms of options.

    Share

Comments have been disabled for this post