Blog Post

Welcome to Berkeley: Where Hadoop isn’t nearly fast enough

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Tucked within the computer science deparment at the University of California, Berkeley, there’s an institution called AMPLab that’s making a name for itself by — among other things — essentially rebuilding the Hadoop platform, only faster.

Results for linear regression test
Results for linear regression test

AMPLab’s most well-known product in the big data space, called Spark, is an in-memory parallel processing framework that’s comparable to Hadoop MapReduce except, its creators claim, it is up to 100 times faster. Because it runs in-memory, Spark might be comparable with something like Druid or SAP’s HANA system, too. Spark is the processing engine that powers ClearStory’s next-generation analytics and visualization service.

Like Hive as a data warehouse for Hadoop? Then you’ll love Shark, which is short for “Hive on Spark.”

Even as Hadoop gets more flexible thanks to new features such as YARN, which would technically allow it to run an alternative framework like Spark, AMPLab has its own cluster-management project called Mesos. Twitter is a big fan of Mesos, which is also an Apache Incubator project.

AMPLab’s latest target is the Hadoop Distributed File System, or HDFS. HDFS has long been criticized for availability and speed, so AMPLab created an alternative called Tachyon (hat tip to High Scalability for calling my attention to it). According to the Tachyon homepage, “it offers up to 300 times higher throughput than HDFS, by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed.”

AMPLab isn’t the first to question the cult of HDFS, though. There are numerous commercial options available, and Quantcast built its own open source file system that it claims is faster and more efficient when running at massive scale.

But it’s probably unfair to call AMPLab’s projects competitors to Hadoop. They’re certainly alternatives, but they’re also complementary, as Twitter’s heavy use of Hadoop and Mesos demonstrates. And Spark, Shark, Mesos and Tachyon are all compatible with their peer projects from the Apache Hadoop project.

Really, AMPLab is doing what any research institution does by pushing the limits of the current commercially available software. If it happens to disrupt the status quo, then so be it. For users, though, it’s just providing some new options to play around with as they try to find the right tool for their particular jobs. Its partners and sponsors, including Google(s goog), Facebook(s fb), Microsoft(s msft) and Amazon(s amzn) Web Services, certainly have an interest in finding the best-possible technology, or creating it if necessary.

The MLBase architecture.
The MLBase architecture.

Other related AMPLab projects include PIQL, a SQL-like query language that sits atop a key-value store; MLBase, a system for doing machine learning on distributed systems; Akaros, an operating system for manycore and large SMP systems; and Sparrow, a cluster-scheduling system designed for low-latency computing.

6 Responses to “Welcome to Berkeley: Where Hadoop isn’t nearly fast enough”

  1. sanjay

    you are missing some details here. Tachyon is a layered file system: the memory based file system is layered on top of another filesystem, the obvious choice being HDFS. So tachyon writes asynchronously to the lower layer such as HDFS, This asynchronous writes allows Tachyon to response quickly to write operations and it also means that Tachyon does not have to rerun the computation when a node fails and its ram disappeared. So Tachyon’s utility is maximized top on another scalable and reliable file system such as HDFS; hence I see Tachyon as complementary to HDFS.

  2. Great work from my alma mater, but “Big Data” means “bigger than available RAM.” If you can run your entire workload in memory, it’s probably not what most people would consider “big data.”

    • Rob, I have never seen that definition. Basically it typically means – so much data that you cannot use “traditional” data tools and processing, at least not cost effectively. Can you run big data “in memory”? Sure. Just probably not one server. And Tachyon is not that.