4 Comments

Summary:

Yahoo has open-sourced its S4 project for developing real-time MapReduce applications. As we’ve seen with Google’s new Caffeine infrastructure for its Instant Search features, there is a growing trend of unchaining large-scale data analysis from its batch-processing roots.

running elephant

Updated: Yahoo has open-sourced its S4 project, a platform for developing real-time MapReduce applications. As we’ve seen with Google’s new Caffeine infrastructure for its Instant Search features, as well other “NoHadoop” tools, there’s a growing trend of unchaining large-scale data analysis – via MapReduce, in particular – from its batch-processing roots.

Inside Yahoo Labs, S4 is being used for “[a]pplications such as personalization, user feedback, malicious traffic detection, and real-time search.” The project website gives a high-level description of how S4 works:

In S4, we abstract the input data as streams of key-value pairs that arrive asynchronously and are dispatched intelligently to processing nodes that produce data sets of output key-value pairs. In search, for example, the output data sets are made available to the serving system before a user executes her next search query. We use this rapid feedback to adapt the search models based on user intent.

The S4 wiki provides more detailed information on the project, and code is available at github.

S4 should become a hot commodity among the community of MapReduce – particularly Hadoop – developers. Just as it has with certain tools developed for Yahoo’s Hadoop distribution, it seems likely Cloudera would incorporate S4 into its Hadoop distribution, which has established itself as solid choice among enterprise users. Perhaps Karmasphere, which sells a platform for developing Hadoop applications, will take up the S4 cause. Either way, S4 represents a free- to low-cost alternative presently available proprietary real-time processing options like multiple IBM InfoSphere products and SAP’s new in-memory HANA appliance.

The recent analytics landgrab illustrates just how hungry customers are to derive insights from their personal data deluges. Churning through streaming data is probably still a ways out for many organizations, but having the tools to actually do it should help catalyze a few efforts. Workloads like those suggested by Yahoo for S4 bring enough value to make it at least worth a try.

To learn more about deploying the right cloud strategy for your needs, attend the free GigaOM Pro webinar, The Scalable Cloud. The webinar takes place at 10:00 a.m. PST on Nov. 4.

Image courtesy of Flickr user bdu.

Related content from GigaOM Pro (sub req’d):

  1. It’s real time stream processing, not map-reduce.

    Share
  2. The Yahoo! labs website is a bit outdated (we’re working on updating it). The official website is http://s4.io

    Share
  3. This is interesting stuff, but you should also check out Esper (http://esper.codehaus.org/), an open-source complex event processing system which solves a number of similar problems that people sometimes mistakenly attempt to use MapReduce to address.

    Share
  4. Thanks Anish for the link…

    Share
  5. [...] movement in the realtime NoSQL world. As GigaOm reports in Yahoo Open-Sources Real-Time MapReduce, Yahoo! is the first to release a large scale implementation of a more realtime oriented NoSQL [...]

    Share
  6. [...] terms of scope, Cloud MapReduce appears to be in the same vein as the S4 project that Yahoo open-sourced last week. Once described as “real-time MapReduce,” the project web site now describes S4 as a [...]

    Share
  7. [...] Hadoop/MapReduce-like tools (Microsoft (Dryad); Appistry (Cloud MapReduce and CloudIQ Storage Hadoop Edition); Yahoo (S4)) [...]

    Share
  8. #yahoo #hadoop #mapreduce #realtime #s4 http://t.co/BlUn5cO3

    Share

Comments have been disabled for this post