2 Comments

Summary:

Hadoop is popular and so is cloud computing, so it comes as no surprise that a battle would break out to establish the best place for running Hadoop. Lately, Google has been scoring some victories on the user side.

image00 (2)

Google’s Compute Engine cloud doesn’t yet have a Hadoop offering of its own, but the platform is making a name for itself as a viable, if not ideal, place to run big data workloads. The latest validation came on Thursday when Qubole, the Hadoop-as-a-service startup from Hive creators Ashish Thusoo and Joydeep Sen Sarma, announced an option that users can choose to run on Compute Engine, which they claim provides better performance than Amazon Web Services.

Specifically, a company spokesperson told me via email, Qubole has seen 2-3x faster startup times for virtual servers using Compute Engine over Amazon EC2 and more reliable performance from Google Cloud Storage than from Amazon S3. We’ll also assume that AWS is the “CloudX” against which Qubole engineer Praveen Seluka benchmarked Compute Engine, some results of which he shared on the Google Cloud Platform blog. Qubole did launch as an AWS-based service though and it seems likely many, if not most, users will still choose to run jobs there if only because they already have data stored in S3.image00 (2)

This isn’t the first time a company previously offering a Hadoop service on AWS has made noise about how fast it runs on Google’s cloud. Last year, Hadoop vendor MapR, whose distribution is available in the dropdown menu of choices on Amazon Elastic MapReduce, touted how it was able to break the record for the MinuteSort data-processing benchmark when running on Compute Engine. It’s currently the only partner given as an option for Hadoop on the Compute Engine website.

Lately, other big data companies — DataStax and DataTorrent, specifically –  have also been touting the performance of Compute Engine, too, although they haven’t been using AWS as point of comparison (not that I asked).

But none of this is empirical evidence that Compute Engine is, indeed, a better place to run big data workloads than is AWS or, for that matter, any other cloud. The usual caveat that “your mileage may vary” probably applies here, too. For a rundown of what Hadoop services are actually available via cloud providers right now, check out this recent post and its comments. (By the way, the Qubole user interface looks pretty nice, if you’re into that sort of thing.)

Qubole GCE Screenshot

And if performance isn’t everything, it’s probably worth considering that AWS has an ever-growing suite of big data services and seemingly never sleeps on adding new ones. In November, the news was its Kinesis streaming-data service and a slew of beefy new instance types. On Friday, it was the availability of Elastic MapReduce instances running Impala, the open source SQL-on-Hadoop query engine that Cloudera developed in order to make Hadoop queries more interactive than was possible using Hive.

impala_launch_console_1

  1. Great to see these technologies converge — now providing mobile access to this only strengthens its potential impact on the progressive organization’s ability to achieve the mobile enterprise status so many desire.

    Peter Fretty

    Share
  2. when you say fast processing big data is about massive amounts of data isn’t it silly to transfer all that data from one cloud to another? If orgnisations are serious about their analytics they can easily put in their own Hadoop/other big data system closer to the data.. Qubole seems to be just hopping on the bubble in cloud and analytics space

    Share

Comments have been disabled for this post