Update: as a follow up to the preliminary benchmarks obtained and cited below, we are working on a new set of benchmarks that are more accurate and reflective of real-world use cases. In these tentative new benchmarks, the performance difference is less significant, and in some cases AWS may hold a lead. More to come and here: https://github.com/Scalr/perf-benchmarks/ .
At Scalr, we’ve been happy customers of Amazon’s infrastructure service, EC2, since 2007. In fact, we’ve built our tools for EC2 because we saw an opportunity to leverage its flexibility to help AWS customers easily design and manage resilient services. But as competitors spring up, we always test them to see how they compare, especially in regards to io performance.
On a warm June day in San Francisco, the Scalr team attended Google I/O 2012. Google was rumored to be launching a EC2 competitor, which we were interested in for our multi-cloud management software. It launched. And boy did it sound good. You see, EC2 and GCE offer pretty much the same core service, but Amazon has been plagued by poor network and disk performance, so Google’s promise to offer both higher and more consistent performance struck a real chord.
Not ones to be fooled by marketing-driven, hyped-up software, we applied for early access and were let in so we could start testing it ourselves. Once we got in, we felt like kids in a candy store. Google Compute Engine is not just fast. It’s Google fast. In fact, it’s a class of fast that enables new service architectures entirely. Here are the results from our tests, along with explanations of how GCE and EC2 differ, as well as comments and use cases.
A note about our data: The benchmarks run to collect the data presented here were taken twice a day, over four days, then averaged. When a high variance was observed, we took note of it and present it here as intervals for which 80 percent of observed data points fall into.
First off, GCE’s API is beautifully simple, explicit and easy to work with. Just take a look at it. Their firewalls are called “firewalls,” vlans are “networks,” and kernels are “kernels” (AKIs, anyone?). Anyone familiar with Unix will feel right at home.
Second, VMs are deployed and started with impressive speed (and we’ve extensively used 10 clouds). It routinely takes less than 30 seconds to login as root after making the insert call to launch a VM. As a reference point, this is the amount of time it takes AWS to get to the running state, after which you still need to wait for the OS to boot, for a total of 120 seconds on a good day, and 300 on a bad one (data points taken from us-east-1).
We don’t know what sort of sorcery Google does here, but they clearly demonstrate engineering prowess. That’s 4-10x faster.
Those of you familiar with Amazon’s EBS volumes know that you can attach and detach volumes to any instance, anytime. On GCE, you can’t (at least not yet). This precludes you from switching drives to minimize downtime: attaching a volume on a running server allows you to skip the boot and configure stages of bringing a new node up, which is useful when promoting an existing mysql slave to master and you just need to swap out storage devices.
While GCE’s “disks” (as they call them) have that one disadvantage, they offer some unique advantages over Amazon volumes. For example, disks can be mounted read-only on multiple instances, which makes for more convenient fileserving than object stores, especially for software such as WordPress (see disclosure) or Drupal that expect a local filesystem. Disks are really fast too, and don’t seem to have the variable performance that plagued EBS before the introduction of Provisioned IOPS. See for yourself in the following benchmarks.
|Writes on ephemeral disk||157 MB/s||38-45 MB/s|
|Reads on ephemeral disk||93.3 MB/s||100-110 MB/s|
|Writes on persistent disks||84.5 MB/s||35-45 MB/s|
|Reads on persistent disks||98.9 MB/s||80-100 MB/s|
As you can see, GCE and EC2 are equivalent on reads, but GCE is 2-4x faster on writes.
A short note about multi-cloud. I’m talking here about services that span multiple clouds, such as replicating a database from us-east-1 to us-west-1, for disaster recovery or latency-lowering purposes, not the multi-cloud management capabilities widely used in the enterprise. I believe that first kind of multi-cloud is a myth driven by the industry’s less tech-savvy folks. I’ve seen too many people attempt it unsuccessfully to recommend it: what usually happens is the slave database falls behind on the master, with an ever-increasing inconsistency window, because the load on the master exceeds the meager bandwidth available between master and slave. Our friends at Continuent are doing great work with Tungsten to accelerate that, but still.
Google’s network is so fast, however, that this kind of multi-cloud might just be possible. To illustrate the difference in speeds, we ran a bandwidth benchmark in which we copied a single, 500 Mb file between two regions. It took 242 seconds on AWS at an average speed of 15 Mbit/s, and 15 seconds on GCE with an average speed of 300Mbit/s. GCE came out 20x faster.
After being so very much impressed, we made a latency benchmark between the same regions. We got an average of 20ms for GCE and 86ms for AWS. GCE came out 4x faster.
This might allow new architectures, and high-load replicated databases might just become possible. Put a slave in different regions of the US (and if/when GCE goes international, why not different regions of the world?) to dramatically speed up SaaS applications for read performance.
Of course, unless Amazon and Google work together to enable Direct Connect, bandwidth from GCE to EC2 will still be slow. I also hear that Amazon is working on creating a private backbone between regions to enable the same use cases, which would be an expected smart move from them.
We’re not quite sure why AWS doesn’t support this, but images on GCE are multi-region (“multi-zone” in their terms), that is to say when you snapshot an instance into an image, you can immediately launch new instances from that image in any region. This makes disaster recovery that much easier and makes their scheduled region maintenance (which occurs a couple of times a year) less of a problem. On that note, I’d also like to add that it forces people to plan their infrastructure to be multi-region, similar to what AWS did for instance failure by making local disk storage ephemeral.
So should you switch?
AWS offers an extremely comprehensive cloud service, with everything from DNS to database. Google does not. This makes building applications on AWS easier, since you have bigger building blocks. So if you don’t mind locking yourself into a vendor, you’ll be more productive on AWS.
But that said, with Google Compute Engine, AWS has a formidable new competitor in the public cloud space, and we’ll likely be moving some of Scalr’s production workloads from our hybrid aws-rackspace-softlayer setup to it when it leaves beta. There’s a strong technical case for migrating heavy workloads to GCE, and I’ll be grabbing popcorn to eagerly watch as the battle unfolds between the giants.
Sebastian Stadil is the founder of Scalr, a simple, powerful cloud management suite, and SVCCG, the world’s largest cloud computing user group. When not working on cloud, Sebastian enjoys making sushi and playing rugby.
Note: Data scientists from LinkedIn, Continuuity, Quantcast and NASA will talk about their hardware and software stacks at our “guru panel” at Structure:Data next week, March 20-21, in New York City.
Disclosure: Automattic, maker of WordPress.com, is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, GigaOM. Om Malik, founder of GigaOM, is also a venture partner at True.
Update: This story was updated at 7:34 a.m. PDT on May 15, 2013 to note that Scalr is working on a new set of benchmarks and will publish those results soon.