Blog Post

Cloud MapReduce Targets Big Data in Real Time

Cloud application-platform provider Appistry has teamed with Accenture (s acn) to develop an on-premise implementation of Accenture’s existing Amazon (s amzn) EC2-focused Cloud MapReduce product. There are two particularly noteworthy aspects of this product:

  1. Cloud MapReduce is focused on real-time analysis of streaming data
  2. Appistry customers now have access to an entirely distributed Hadoop alternative. Earlier this year, the company released its Hadoop Distributed File System (HDFS) alternative called CloudIQ Storage Hadoop Edition.

As I explained in a post on that product, Appistry’s primary goal in developing these products is to improve performance and reliability by eliminating single points of failure. In HDFS, that’s the NameNode; in the Hadoop MapReduce engine, that’s the JobTracker. Running atop Appistry’s CloudIQ platform, Cloud MapReduce takes advantage of a peer-to-peer architecture in which these concerns are largely ameliorated.

Appistry’s Sam Charrington said those were the same issues that led Accenture to develop Cloud MapReduce in the first place, as its customers wanted higher reliability and performance for mission-critical jobs. As it turns out, however, certain users in intelligence and defense weren’t too keen on the cloud-based model. Jointly developed and distributed by both companies, on-premise Cloud MapReduce keeps the same focus on bleeding-edge customers in intelligence, defense and financial services.

Then there are the real-time capabilities. Cloud MapReduce utilizes a streaming API that frees it from the batch-processing boundaries typically associated with MapReduce. As Charrington explained, Hadoop’s popularity has shaped many connotations of MapReduce, but “the algorithm can be applied much more broadly.”Cloud MapReduce also leverages existing CloudIQ capabilities, such as Fabric Accessible Memory, a form of in-memory caching to speed data processing. “It’s not a competitor to Hadoop,” he added, “so much as an alternative to other approaches for processing data streams [such as IBM (s ibm) InfoSphere Streams].” In fact, Appistry retains partnerships within the Hadoop ecosystem so that customers have a choice of options depending on their applications.

In terms of scope, Cloud MapReduce appears to be in the same vein as the S4 project that Yahoo open-sourced last week. Once described as “real-time MapReduce,” the project website now describes S4 as a “distributed stream computing platform” that “fills the gap between complex proprietary systems and batch-oriented open source computing platforms.” According to a research paper (PDF), S4 was inspired by MapReduce but more closely resembles the Actors model. Like Cloud MapReduce, S4 is wholly decentralized to improve reliability and performance.

Both Cloud MapReduce and S4 should catch on (S4 likely sooner because it’s an open source project, not a paid product), but it might take time. In the case of Cloud MapReduce, many organizations with Big Data problems are still experimenting with Hadoop for batch-processing, and might not be ready to take on writing parallel-processing applications for real-time data. Even Charrington acknowledges that Appistry’s two products might be unnecessary for Hadoop experimentation or R&D projects, but are designed for mission-critical production applications that require real-time analysis. And there aren’t too many of those around right now.

Aside from relatively low-hanging fruit like fraud detection and instant search, it will be fascinating to see the applications for these types of technologies once organizations are able to wrap their minds around the full scope of their data situations. You can bet social-media analysis will be an early priority, but that’s just the tip of the iceberg. Just ask IBM, which is beating the real-time drum with its Smarter Planet initiative.

Image courtesy of Flickr user jpctalbot.

Related content from GigaOM Pro (sub req’d):

One Response to “Cloud MapReduce Targets Big Data in Real Time”

  1. Great article Derrick, and very timely. Over on Quora, there’s a closely related discussion going on about realtime stream processing with Cloudscale, Yahoo S4 and Cloudera Flume.

    Cloudscale has just posted some new “Cloud Speed Records” for realtime stream processing (single stream, 2 million rows per second, concurrent analytics on one million Facebook_User_IDs, 8 AWS cluster nodes etc.). It would be great if the Appistry-Accenture guys could post some ballpark numbers for comparison, and any of the current S4 users too, if they have any results yet. As they say, “You can’t improve what you can’t measure.”

    There’s also another Quora discussion that talks about how S4 came to be called a Realtime variant of MapReduce, despite having no technical connection to MapReduce or Hadoop.

    As Anish Nair (Yahoo S4 developer) says in the comments there “We definitely didn’t start out with building real-time MapReduce. When folks started comparing it with MapReduce and pointing out similarities, we spent some time thinking about it. We concluded that while there are vague similarities, its not the right way to think about it. The Actors paradigm is a much closer fit”.

    and I replied

    “Of course, there’s now a whole worldwide army of people out there (re)tweeting away to their friends like crazy that MapReduce is no longer ultra-batch, but now Realtime. But hey, that’s their problem not yours. If they actually read the introductory S4 docs your team provided its very clear that its not Realtime MapReduce”.

    Do the Appistry-Accenture guys believe there solution IS a Realtime MapReduce? There are a variety of serious academic research projects at Berkeley (HOP), Brown (C-MR) and elsewhere that are working to overcome the challenges posed by the ultra-batch architecture of MapReduce/Hadoop, in making it usable for realtime.

    If the Appistry-Accenture product really does solve the kinds of issues that HOP, C-MR etc. have been grappling with, making that research work obsolete, that would of course be very interesting. Any clarification would be helpful.