Blog Post

Twitter to open source Hadoop-like tool

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Attention webscale aficionados, Twitter says it is planning to open source Storm, its Hadoop-like real-time data processing tool. In a blog post Thursday, the microblogging network said it plans to release the Storm code on Sept. 19 at the Strange Loop event in St. Louis, Mo.

The question is — does the world need another real-time data processing tool? After all there are many tools like HStreaming (using Hadoop), the open source S4 and StreamBase, but the overall analytics market (if you can call it a market) is already fragmented.┬áThe Storm code comes from Twitter’s acquisition of BackType last month and seems to be an effort to get folks comfortable parsing data on Twitter.

The post does an excellent job laying out use cases for Storm and hints at more to come. While the code can deal with distributed nodes and huge amounts of data a la Hadoop or Map Reduce, Storm handles jobs that are “infinite.” It’s not for a data processing job with an end point, it’s good for streams of data and continual processing. From the post by Nathan Marz:

Here’s a recap of the three broad use cases for Storm:

  • Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
  • Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
  • Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.

But wait! There’s more! At the end of the post we are assured that there’s more to Storm than the blog post has even defined, which we can learn more about next month at the Strange Loop event. From the post:

I’ve only scratched the surface on Storm. The “stream” concept at the core of Storm can be taken so much further than what I’ve shown here — I didn’t talk about things like multi-streams, implicit streams, or direct groupings. I showed two of Storm’s main abstractions, spouts and bolts, but I didn’t talk about Storm’s third, and possibly most powerful abstraction, the “state spout”. I didn’t show how you do distributed RPC over Storm, and I didn’t discuss Storm’s awesome automated deploy that lets you create a Storm cluster on EC2 with just the click of a button.

So for those anxious to test out a new method of crunching terabytes of real-time data on the fly, get thee to GitHub! And wait.

9 Responses to “Twitter to open source Hadoop-like tool”

  1. Joseph Heck

    Why would you question if another tool is good for the environment? It’s not a direct replica of any of the others, and this technology space is still incredibly immature. The near-real-time stream processing space, in particular, is quite narrow in terms of options.

  2. Anand Prakash

    There is no similarity between Storm and Hadoop. Hadoop is a batch processing system and this is a real time data processing system. These two are completely different use-cases. I know that real-time data would be so much better – but given that Storm needed to be backed by a data store which does aggregation on each of the bolts – will make it likely very slow for processing trillions of words.
    Storm would be very good for continuous real time stats on say billions of words.