11 Comments

Summary:

Two key members of the Facebook team that created the Hadoop query language Hive are launching their own big data startup called Qubole on Thursday. Qubole is a managed version of Hive that’s hosted on the Amazon Web Services cloud computing infrastructure.

Ashish-photo

Two key members of the Facebook team that created the Hadoop query language Hive are launching their own big data startup called Qubole on Thursday. Co-founders Ashish Thusoo and Joydeep Sen Sarma are taking that experience to provide a managed version of Hive that’s hosted on the Amazon Web Services cloud computing infrastructure.

Joydeep Sen Sarma

The founders know from whence they speak when it comes to Hadoop and Hive, which is a framework and associated query language that sits above Hadoop and turns into something resembling a traditional SQL-based data warehouse. Both worked at Facebook from 2007 through 2011 and held senior positions on the data infrastructure team, where they played major roles in creating Hive and scaling Facebook’s Hadoop cluster to 25 petabytes of compressed data (it’s now up past 30 petabytes). Thusoo also spent a year as the project lead for Hive with the Apache Software Foundation.

Ashish Thusoo

Among the inspirations for Qubole was the work that Facebook’s data infrastructure team did to let users across Facebook access the data they needed for their jobs, Thusoo told me. “We want for end users to get access to their data without having to go through intermediaries,” he said.

Still in its limited early access phase, Qubole is targeting fairly sophisticated data analysts and data scientists who know how to write SQL queries and create data pipelines.

To achieve its goal, Qubole provides an abstraction layer between users and the underlying infrastructure so they don’t have to know about Hadoop system management in order to analyze the datasets they have stored in Amazon S3. The service spins up users’ clusters only when a job is started, then automatically scales or contracts them based on the workload, and spins the servers down once the job is done. “We’ve been able to do that by working in the guts of Hadoop,” Thusoo said.

Qubole is also optimized to run on cloud-based resources that typically don’t offer performance on a par with their physical counterparts. Thusoo said the product incorporates a specially-designed cache system that lets queries run five times faster than traditional Hadoop jobs in the cloud, and users have the option to change the types of instances their jobs are running on if the situation requires. For example, while the default instance type is Amazon EC2’s High-Memory Extra Large, a memory-intensive job might perform better on High-Memory Quadruple Extra Large instances.

Screenshot of a completed job in Qubole

Another feature, and something Thusoo said they did a lot at Facebook, is the ability to run a sample query on a small section of data before sending it to the Hadoop cluster. This helps spot problems that could result in wasted time and wasted money if queries with bugs are allowed to run to completion.

Although Qubole’s current focus on the Hive interface is unique, it’s far from the only option around for running Hadoop in the cloud. Startups such as Infochimps and Mortar Data are also trying to eliminate the cost and complexity of managing Hadoop clusters with their developer-friendly cloud services. So is Microsoft. Users who prefer system-level control can use Amazon’s Elastic MapReduce service or just deploy a Hadoop distribution on cloud servers.

The bigger picture, I think, might be the growing number of former Facebook employees putting the skills and lessons learned at the social networking giant to use at other companies. Thusoo and Sen Sarma join Cloudera’s Jeff Hammerbacher in the Hadoop world, while a team of former Facebook engineers founded collaboration startup Cove and are now part of team Dropbox. Although execution ultimately wins the day, there are still relatively few people (outside Google and Yahoo) with experience managing data or infrastructure at Facebook’s scale.

  1. Indians are rocking in the valley.

    Share
  2. Farihamaniar
    Mobile structure well change me.

    Share
  3. Qubole stands for “Why Say”…nice and catchy name

    Share
  4. rocking? u mean creating disaster in the IT world?…that’s right you all better stay in the valley and rock yourself in a cradle to minimize the chaos you guys had been creating in software industry…

    Share
    1. Well said.

      Share
    2. breakthecrack Tuesday, June 12, 2012

      That’s right “crack”. Innovation is always disruptive, for your kind information, regardless of who does it. Try as you may, you can’t stay at status quo. Change or wither.

      Share
    3. Looks like you lost your job to one of the Chaos creators. Sorry about that.

      Share
  5. Am I reading the screenshot right? Does a simple query taking 8secs in the database take 252secs via the cloud?

    Share
  6. @AT – the 250 seconds includes the time required to spin up the cluster (one time cost). You can see it at the beginning of the query log (‘Provisioning cluster’). One of the things we do to make things easy is to automagically create/scale and destroy clusters.

    Not all queries need a separate cluster per customer (this one is a good example). we will cover those use cases in future (so that this bringup latency can be completely avoided in some cases).

    Share
  7. Following on from @AT’s comment, am I reading right that it took 8 seconds to count 10000 records? And there was a single mapper and single reducer? I’m not being critical, just curious…

    Share
  8. @PB – Hadoop is dog slow for small interactive jobs. The architecture was designed for large jobs. We are doing a bunch of interesting stuff in this area – but our work is cut out. Hopefully we will make some dramatic improvements in this area as well.

    Share

Comments have been disabled for this post