Two key members of the Facebook team that created the Hadoop query language Hive are launching their own big data startup called Qubole on Thursday. Co-founders Ashish Thusoo and Joydeep Sen Sarma are taking that experience to provide a managed version of Hive that’s hosted on the Amazon Web Services cloud computing infrastructure.
The founders know from whence they speak when it comes to Hadoop and Hive, which is a framework and associated query language that sits above Hadoop and turns into something resembling a traditional SQL-based data warehouse. Both worked at Facebook from 2007 through 2011 and held senior positions on the data infrastructure team, where they played major roles in creating Hive and scaling Facebook’s Hadoop cluster to 25 petabytes of compressed data (it’s now up past 30 petabytes). Thusoo also spent a year as the project lead for Hive with the Apache Software Foundation.
Among the inspirations for Qubole was the work that Facebook’s data infrastructure team did to let users across Facebook access the data they needed for their jobs, Thusoo told me. “We want for end users to get access to their data without having to go through intermediaries,” he said.
Still in its limited early access phase, Qubole is targeting fairly sophisticated data analysts and data scientists who know how to write SQL queries and create data pipelines.
To achieve its goal, Qubole provides an abstraction layer between users and the underlying infrastructure so they don’t have to know about Hadoop system management in order to analyze the datasets they have stored in Amazon S3. The service spins up users’ clusters only when a job is started, then automatically scales or contracts them based on the workload, and spins the servers down once the job is done. “We’ve been able to do that by working in the guts of Hadoop,” Thusoo said.
Qubole is also optimized to run on cloud-based resources that typically don’t offer performance on a par with their physical counterparts. Thusoo said the product incorporates a specially-designed cache system that lets queries run five times faster than traditional Hadoop jobs in the cloud, and users have the option to change the types of instances their jobs are running on if the situation requires. For example, while the default instance type is Amazon EC2′s High-Memory Extra Large, a memory-intensive job might perform better on High-Memory Quadruple Extra Large instances.
Another feature, and something Thusoo said they did a lot at Facebook, is the ability to run a sample query on a small section of data before sending it to the Hadoop cluster. This helps spot problems that could result in wasted time and wasted money if queries with bugs are allowed to run to completion.
Although Qubole’s current focus on the Hive interface is unique, it’s far from the only option around for running Hadoop in the cloud. Startups such as Infochimps and Mortar Data are also trying to eliminate the cost and complexity of managing Hadoop clusters with their developer-friendly cloud services. So is Microsoft. Users who prefer system-level control can use Amazon’s Elastic MapReduce service or just deploy a Hadoop distribution on cloud servers.
The bigger picture, I think, might be the growing number of former Facebook employees putting the skills and lessons learned at the social networking giant to use at other companies. Thusoo and Sen Sarma join Cloudera’s Jeff Hammerbacher in the Hadoop world, while a team of former Facebook engineers founded collaboration startup Cove and are now part of team Dropbox. Although execution ultimately wins the day, there are still relatively few people (outside Google and Yahoo) with experience managing data or infrastructure at Facebook’s scale.