Blog Post

Beyond Hadoop: Next-Generation Big Data Architectures

Stay on Top of Emerging Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

After 25 years of dominance, relational databases and SQL have in recent years come under fire from the growing “NoSQL movement.” A key element of this movement is Hadoop, the open-source clone of Google’s (s goog) internal MapReduce system. Whether it’s interpreted as “No SQL” or “Not Only SQL,” the message has been clear: If you have big data challenges, then your programming tool of choice should be Hadoop.

The only problem with this story is that the people who really do have cutting edge performance and scalability requirements today have already moved on from the Hadoop model. A few have moved back to SQL, but the much more significant trend is that, having come to realize the capabilities and limitations of MapReduce and Hadoop, a whole raft of radically new post-Hadoop architectures are now being developed that are, in most cases, orders of magnitude faster at scale than Hadoop. We are now at the start of a new “NoHadoop” era, as companies increasingly realize that big data requires “Not Only Hadoop.”

Simple batch processing tools like MapReduce and Hadoop are just not powerful enough in any one of the dimensions of the big data space that really matters. Sure, Hadoop is great for simple batch processing tasks that are “embarrassingly parallel”, but most of the difficult big data tasks confronting companies today are much more complex than that. They can involve complex joins, ACID requirements, real-time requirements, supercomputing algorithms, graph computing, interactive analysis, or the need for continuous incremental updates. In each case, Hadoop is unable to provide anything close to the levels of performance required. Fortunately, however, in each case there now exist next-generation big data architectures that can provide that required scale and performance. Over the next couple of years, these architectures will break out into the mainstream.

Here is a brief overview of the current NoHadoop or post-Hadoop space. In each case, the next-gen architecture beats MapReduce/Hadoop by anything from 10x to 10,000x in terms of performance at scale.

SQL. Having been around for 25 years, it’s a bit weird to call SQL next-gen, but it is! There’s currently a tremendous amount of innovation going on around SQL from companies like VoltDB, Clustrix and others. If you need to handle complex joins, or need ACID requirements, SQL is still the way to go. Applications: Complex business queries, online transaction processing.

Cloudscale. [McColl is the CEO of Cloudscale. See his bio below.] For realtime analytics on big data, it’s essential to break free from the constraints of batch processing. For example, if you’re looking to continuously analyze a stream of events at a rate of one million events per second per server, and deliver results with a maximum latency of five seconds between data in and analytics out, then you need a real-time data flow architecture. The Cloudscale architecture provides this kind of realtime big data analytics, with latency that is up to 10,000X faster than batch processing systems such as Hadoop. Applications: Algorithmic trading, fraud detection, mobile advertising, location services, marketing intelligence.

MPI and BSP. Many supercomputing applications require complex algorithms on big data, in which processors communicate directly at very high speed in order to deliver performance at scale. Parallel programming tools such as MPI and BSP are necessary for this kind of high performance supercomputing. Applications: Modelling and simulation, fluid dynamics.

Pregel. Need to analyse a complex social graph? Need to analyse the web? It’s not just big data, it’s big graphs! We’re rapidly moving to a world where the ability to analyse very-large-scale dynamic graphs (billions of nodes, trillions of edges) is becoming critical for some important applications. Google’s Pregel architecture uses a BSP model to enable highly efficient graph computing at enormous scale. Applications: Web algorithms, social graph algorithms, location graphs, learning and discovery, network optimisation, internet of things.

Dremel. Need to interact with web-scale data sets? Google’s Dremel architecture is designed to support interactive, ad hoc queries over trillion-row tables in seconds! It executes queries natively without translating them into MapReduce jobs. Dremel has been in production since 2006 and has thousands of users within Google. Applications: Data exploration, customer support, data center monitoring.

Percolator (Caffeine). If you need to incrementally update the analytics on a massive data set continuously, as Google now has to do on its index of the web, then an architecture like Percolator (Caffeine) beats Hadoop easily; Google Instant just wouldn’t be possible without it. “Because the index can be updated incrementally, the median document moves through Caffeine over 100 times faster than it moved through the company’s old MapReduce setup.” Applications: Real time search.

The fact that Hadoop is freely available to everyone means it will remain an important entry point to the world of big data for many people. However, as the performance demands for big data apps continue to increase, we will find these new, more powerful forms of big data architecture will be required in many cases.

Bill McColl is the founder and CEO of Cloudscale Inc. and a former professor of Computer Science, Head of the Parallel Computing Research Center, and Chairman of the Computer Science Faculty at Oxford University.

Related GigaOM Pro Research (sub req’d):

18 Responses to “Beyond Hadoop: Next-Generation Big Data Architectures”

  1. There are a plethora of solutions for “Big” or “Scalable” products. To my knowledge, there are only a few fundamental technologies, or technology classes.

    It would be good for products to start building on standard building-blocks where they can. Then the roadmap would be clearer for CIOs and architects, and products could start building on lower layers rather than recreating them.

    The following are the fundamental technologies at the lower layer of these products, and how they are evolving to meet demand.

    > Data store technologies. This is where Big Data comes into play, e.g. in Hadoop and BigTable – where data is often referenced in very big blocks. SQL vendors are fighting back with clustering, e.g. MySQL Cluster.

    > Batch analytics processing, feeding of databases, Hadoop (as in Hive) or data warehouses.

    > “High DPTS” is my term for applications that have large multipliers for Data x Processing x Tx/second x Scalable. [You could argue that scalability is a consequence of high DPT … but it doesn’t do any harm to say ‘scalable’ and emphasise the problems.] This category used to be called XTP … not really sure that term ever caught on.

    > Underlying High DPTS are products for Grid products – GigaSpaces, Microsoft Velocity etc. Their fundamental features are Scalability, Partitioning and Data Affinity, Map Reduce and Hot Standby. These products are the natural basis for applications (Web commerce, SOA) and real-time data analytics.

    > For the subset of these applications that need write transactions (as in, ACID), the Grid must have a transactional capability that manages ACID across the grid and down to the data stores (e.g., CloudTran).

    > Real-time analytics will need some query language to take account of the movement of data through its grid.

  2. You get what you pay for; a simple as that. If NoSQL serves you good and there’s no need for relational and/or transaction modeling, than go for it.
    Practical exercise implies this is usually not the case and we se indeed many folks go back to RDBMS. For those who never left… or for those who are already there – MySQL backend applications – we’re there to support. A SQL Cloud DB that is elastically scalable and highly available. Don’t take my word, checkout this out on our Beta

  3. Hey guys. Thanks for the comments.

    I was surprised that some of you thought it controversial that we were moving into a Post-Hadoop era. I guess it depends which world you live in. I write from the perspective of the Silicon Valley startup world. In that world the game is moving on, since MapReduce (a 7 year old programming model) is now available from any one of the big established vendors: Amazon Elastic MapReduce for cloud MR, IBM BigInsights for spreadsheet-fronted Hadoop and for DB2 with Hadoop tables, Microsoft’s Dryad being prepared for commercial launch, Oracle integration with Hadoop. It doesn’t get more mainstream than Amazon, IBM, Microsoft and Oracle! As I say in the article, the new wave of startup innovation in big data architectures is going to be around the many areas where MR/Hadoop is nowhere near enough: in-memory, transactions, realtime, graph, exascale fault-tolerant message passing and RDMA, interactive, incremental.

    As I also said in the article, Hadoop is, and will remain, a small but important part of the overall big data ecosystem. It’s very basic coarse-grain style provides easy fault tolerance for the simple types of parallel apps that some organizations have today. And it comes in a free version too, which is great! However, as soon as performance (throughput and/or latency) begins to matter, as usual you’re in a “One Size Doesn’t Fit All” situation and you need to look for the right kind of architecture.


  4. Stephen T

    Really hilarious to see the comments on this article from the Hadoop guys about MPI and BSP. McColl invented the BSP approach to parallel programming along with Leslie Valiant of Harvard back in the 1990s. He also led the international team that defined the standard library for BSP programming, as a simpler and faster alternative to MPI, and his team built all the programming tools. His 1996 paper “Questions and Answers about BSP” is the standard introduction to BSP software. The Wikipedia page

    is based on that. There’s even a link there to an ancient (1998) web page by McColl on BSP which has tons of papers, and a link to a Cover Story in New Scientist on McColl, Valiant and BSP.

    The fact that Google has recently rediscovered BSP and has used it in an exciting way to build Pregel, which now accounts for more than 20% of all big data computing at Google, is further validation of how important this approach is today.

    • Andrew Purtell

      Speaking of MPI as a “next generation” technology beyond Hadoop is inverting history. But implied in your comment (I think) is that this is some kind of competition. I don’t get it. MPI and MapReduce are very different tools that solve different and at least partially exclusive sets of problems. I go back to the “horses for courses” comment made by the first poster.

    • Patrick Angeles

      And yet you haven’t addressed Steve L’s assertion that MPI doesn’t handle failure well. And none of the Hadoop guys actually put down BSP, but somehow it got non-sequitur’d into a plug for the author.

      Speaking of non-sequiturs, how about this one:

      [the] “No SQL” … the message has been clear: If you have big data challenges, then your programming tool of choice should be Hadoop.

      Nobody working closely with Hadoop will:
      1. touch the “NoSQL” movement with a 10 foot pole.
      2. use it as a silver bullet for all “big data challenges”.

      • The reason why its the future is it needs integration at the app / code level. Most existing apps will not be ported…

        SQL s*cks as well. As much as I loved linear algebra, that’s not how the world is organized.

        Another reason there is time, is we need to teach the next generation of developers a different way, that’s what takes the real time, not the technology…

  5. 1. Google are using something nobody else can see -it’s hard to say they’ve moved on from Hadoop, merely evolved their own MR engine.

    2. Nobody in Hadoop-land is going to say you should use Hadoop and friends if you want transactions, ACID, etc. What we do say is “you don’t need to index all the stuff you want to search through later”, and “if you keep some stuff in a distributed filesystem, you make storing PB affordable”

    3. What Hadoop does have is testing at double digit petabyte storage capacity, thousands of servers, each with 6+ HDDs.

    4. MPI. MPI doesn’t handle failure well. Which is why most HPC facilities don’t like MPI jobs that take more than 48h to complete -too much risk of an outage. I think MPI is great for some problems, but it’s not the silver bullet either.

    What Hadoop does bring to the table is community and scale. Nobody in the group think’s it’s perfect, but we know what the MapReduce problems are (latency due to the saving of intermediate results to HDD and a wait for all maps to complete before the reduces), and those of the filesystem (the namenode is an SPOF, better checksumming and security; the latter is trickling out). It’s also designed for a static set of machines; when hosted in on-demand infrastructure you need to integrate the infrastructure operations into your workflow. We know them, people in different companies and some universities are working on them. It’s going to be hard to compete with the community, even if you have better solutions.

    People used to dismiss Linux compared to “real” unix, remember?

  6. Andrew Purtell

    Percolator is built on top of BigTable. In the Hadoop ecosystem, we have HBase as an open source implementation of BigTable, and it seems feasible to build an open equivalent to Percolator on top of the HBase coprocessor framework. Hadoop is not just MapReduce. This article is written as if Hadoop has stood still since 2006.

  7. Good article, although its verbosity could have been map reduced ;-) to simply “its a matter of horses for courses” or “Hadoop is not the be-all end-all” (but then again, who said that Hadoop was the be-all end-all? I don’t recall anyone in the Hadoop community stating so).