12 Comments

Summary:

Facebook has open sourced Presto, a SQL engine it says is on average 10 times faster than Hive for running queries across large data sets stored in Hadoop and elsewhere.

presto

Facebook has open sourced Presto, the interactive SQL-on-Hadoop engine the company first discussed in June. Presto is Facebook’s take on Cloudera’s Impala or Google’s Dremel, and it already has some big-name fans in Dropbox and Airbnb.

Technologically, Presto and other query engines of its ilk can be viewed as faster versions of Hive, the data warehouse framework for Hadoop that Facebook created several years ago. Facebook and many other Hadoop users still rely heavily on Hive for batch-processing jobs such as regular reporting, but there has been a demand for something letting users perform ad hoc, exploratory queries on Hadoop data similar to how they might do them using a massively parallel relational database.

Presto is 10 times faster than Hive for most queries, according to Facebook software engineer Martin Traverso in a blog post detailing today’s news.

Source: Facebook

Source: Facebook

Technologically, Hive and Presto are very different, namely because the former relies on MapReduce to carry out its processing and the latter does not. This is by and large the difference that makes Presto suitable for low-latency queries while the MapReduce-based Hive can take a long time — especially over Facebook’s many petabytes of data — because it must scan everything in the cluster and requires lots of disk writes. Presto also works with a variety of non-Hadoop-Distributed-File-System data sources and uses ANSI SQL compared with Hive’s SQL-like language.

Presto is currently running in numerous Facebook data centers and the company has scaled a single cluster up to 1,000 nodes. More than 1,000 employees run queries on Presto, and they do more than 30,000 of them per day over a petabyte of data. Traverso’s post gives a lot more details about how Presto works and how Facebook plans to improve its speed and functionality in the near term.

A Presto screenshot

A Presto screenshot

However, I think the most-interesting part about Presto might be less technological and more about its effects on the Hadoop industry, which is projected to be worth tens of billions of dollars in the next few years. The mere fact that Facebook chose to create a website for the project says something about how serious the company takes it. And although Facebook has technically open sourced quite a few Hadoop improvements over the years, this is the first since Hive where I’ve noticed such fast (if any) uptake from external companies.

It will be interesting to watch how, if at all, Presto affects adoption of Cloudera’s Impala, Hortonworks’ Stinger project, Pivotal’s HAWQ or any other of the myriad SQL-on-Hadoop engines currently making fighting for mindshare. The fact that Presto is open source and ready to use certainly has to be a big draw for some users, and could help it establish a solid user base while other technologies are still coming to be.

Facebook isn’t looking to compete with other projects and doesn’t have a horse in the race from a business perspective — it will likely go along using and improving Presto at its own pace regardless what happens — but serious uptake could inspire the Hadoop vendors to change their strategies when it comes to the SQL engines they support. Much of the early innovation from Hadoop came from power users (including Yahoo and Facebook) rather software companies, and it’s possible we haven’t seen the end of that trend.

  1. Also see the Product HSearch from Bizosys Technologies.

    Share
  2. This is awesome for those of us developing software! douglaufferthemasterofelearning.com

    Share
  3. Hi Derrick,

    you forgot to mention Shark.
    It is open source for a while and has benchmark vs Hive & Impala: https://amplab.cs.berkeley.edu/benchmark/

    Regards,
    Assaf

    Share
  4. I thought hadoop basically was mapreduce. How does presto manage the data if it doesn’t use mapreduce?

    Share
    1. Arguably, Hadoop is actually mainly HDFS. Facebook and some others worked around it for SQL stuff, and now YARN makes it possible to do all sorts of processing of data in HDFS.

      Share
      1. Facebook doesn’t use YARN ;-)

        Share
  5. An interesting insight. Thanks Derrick. Rhetorically, I wonder if NO-SQL will play a larger role in big-data’s future?

    Share
  6. what do they need 1000 employees running queries for ???

    Share
    1. Maybe they’re looking stuff and then selling ads … wait couldn’t a computer do that? ;-)

      Share
  7. Very nice, would like to see more.

    Share
  8. i see Romania is in good company :)

    Share
  9. Your failure to scope the hadoop cluster involved is ridiculous. v2 has YARN which allows non-map-reduce so this is nothing new or fancy from fb. Hive 11 was just released with significant performance upgrades.

    As you say your real interest in the business side; since you don’t understand the eco-system or care to detail it, I can assume your business interest is in stocks?

    Share

Comments have been disabled for this post