Databricks, the startup focused on commercializing the popular Apache Spark data-processing framework, has used Spark to crush a benchmark record previously set using Hadoop MapReduce. The company says it’s a misconception that Spark is only significantly faster than MapReduce for datasets that can fit a cluster’s memory, and that this test ran entirely on disk (solid-state drives, to be exact) helps to prove that.
Using 206 machines and nearly 6,600 cores on the Amazon Web Services cloud, Databricks completed the Daytona GraySort test, which involves sorting 100 terabytes of data, in just 23 minutes. DataThe previous record was set by Yahoo, which used a 2,100-node Hadoop cluster with more than 50,000 cores to complete the test (albeit on 102.5 terabytes) in 72 minutes. The Databricks benchmark did use solid-state drives — which are the default storage media in the current generation of AWS instances — rather than hard disk drives.
To address early concerns about Spark’s ability to reliably handle large-scale datasets, the Databricks team then ran an unofficial the test and completed the same benchmark across a petabyte of data, on 190 machines, in just under 4 hours. “We could have kept going,” Databricks Director of Customer Engagement Arsalan Tavakoli said, but noted there are few companies that need to scale beyond that. He added that if anyone still wants proof that Spark can scale beyond that, and on production workloads, they should look at Alibaba’s Spark cluster that spans hundreds of petabytes.
Ali Ghodsi, Databricks’ head of engineering, said the type of shuffle operation this test involves “turns out to be the most expensive, most advanced operations you do in this kind of big data system.” And although benchmarks are often criticized as having limited real-world applicability, he said shuffling is a common operation in production while running joins in Spark SQL or certain machine learning computations, for example.
Databricks shared more details about the benchmark, its validity and its methodology in a blog post on Friday.
Proving Spark’s ability to handle large datasets on disk and on cloud resources is critical, as the company expects the majority of its revenue to come from the Databricks Cloud service it announced in June. That service, which includes tools for running Spark processing jobs as well as analyzing the results, is hosted on Amazon Web Services. Ultimately, Ghosdi said, Databricks Cloud won’t just run individual Spark jobs but will plug into users’ applications via API to handle their data-processing needs.
At present, “well over a thousand” users have signed up for the cloud service and the company is in the process of on-boarding them all, Tavakoli said. He added that Databricks doesn’t make any money from its Spark certification programs, and only makes a relatively small amount from support deals in place with partners such as Cloudera and DataStax.
Both Tavakoli and Ghosdi were quick to point out that although the Spark community does think Spark provides a better set of tools for various types of data processing (batch jobs, SQL queries and stream processing among them), it’s still very compatible with Hadoop overall. The GraySort benchmark tests used the Hadoop Distributed File System (HDFS) as the storage layer, and Databricks Cloud supports data stored in either Amazon S3 or HDFS (running on AWS instances). And if you’re running Spark on-premises, Ghodsi said, downloading it as part of a commercial Hadoop distribution is still the best way to do it.
You can learn more about what Spark is and how it came to be in this Structure Show podcast interview with its co-creator and Databricks CTO, Matei Zaharia, from June.
Update: This post was updated at 9:15 a.m. on Monday to clarify that Databricks ran its benchmark on solid-state drives — the default storage medium on new Amazon Web Services instances — and not hard-disk drives.