Cloud MapReduce Targets Big Data in Real Time

speed

Cloud application-platform provider Appistry has teamed with Accenture  to develop an on-premise implementation of Accenture’s existing Amazon EC2-focused Cloud MapReduce product. There are two particularly noteworthy aspects of this product:

  1. Cloud MapReduce is focused on real-time analysis of streaming data
  2. Appistry customers now have access to an entirely distributed Hadoop alternative. Earlier this year, the company released its Hadoop Distributed File System (HDFS) alternative called CloudIQ Storage Hadoop Edition.

As I explained in a post on that product, Appistry’s primary goal in developing these products is to improve performance and reliability by eliminating single points of failure. In HDFS, that’s the NameNode; in the Hadoop MapReduce engine, that’s the JobTracker. Running atop Appistry’s CloudIQ platform, Cloud MapReduce takes advantage of a peer-to-peer architecture in which these concerns are largely ameliorated.

Appistry’s Sam Charrington said those were the same issues that led Accenture to develop Cloud MapReduce in the first place, as its customers wanted higher reliability and performance for mission-critical jobs. As it turns out, however, certain users in intelligence and defense weren’t too keen on the cloud-based model. Jointly developed and distributed by both companies, on-premise Cloud MapReduce keeps the same focus on bleeding-edge customers in intelligence, defense and financial services.

Then there are the real-time capabilities. Cloud MapReduce utilizes a streaming API that frees it from the batch-processing boundaries typically associated with MapReduce. As Charrington explained, Hadoop’s popularity has shaped many connotations of MapReduce, but “the algorithm can be applied much more broadly.”Cloud MapReduce also leverages existing CloudIQ capabilities, such as Fabric Accessible Memory, a form of in-memory caching to speed data processing. “It’s not a competitor to Hadoop,” he added, “so much as an alternative to other approaches for processing data streams [such as IBM InfoSphere Streams].” In fact, Appistry retains partnerships within the Hadoop ecosystem so that customers have a choice of options depending on their applications.

In terms of scope, Cloud MapReduce appears to be in the same vein as the S4 project that Yahoo open-sourced last week. Once described as “real-time MapReduce,” the project website now describes S4 as a “distributed stream computing platform” that “fills the gap between complex proprietary systems and batch-oriented open source computing platforms.” According to a research paper (PDF), S4 was inspired by MapReduce but more closely resembles the Actors model. Like Cloud MapReduce, S4 is wholly decentralized to improve reliability and performance.

Both Cloud MapReduce and S4 should catch on (S4 likely sooner because it’s an open source project, not a paid product), but it might take time. In the case of Cloud MapReduce, many organizations with Big Data problems are still experimenting with Hadoop for batch-processing, and might not be ready to take on writing parallel-processing applications for real-time data. Even Charrington acknowledges that Appistry’s two products might be unnecessary for Hadoop experimentation or R&D projects, but are designed for mission-critical production applications that require real-time analysis. And there aren’t too many of those around right now.

Aside from relatively low-hanging fruit like fraud detection and instant search, it will be fascinating to see the applications for these types of technologies once organizations are able to wrap their minds around the full scope of their data situations. You can bet social-media analysis will be an early priority, but that’s just the tip of the iceberg. Just ask IBM, which is beating the real-time drum with its Smarter Planet initiative.

Image courtesy of Flickr user jpctalbot.

Related content from GigaOM Pro (sub req’d):

loading

Comments have been disabled for this post