74 Comments

Summary:

When Google launched its EC2 rival, Google Compute Engine, last June, it set some high expectations. Sebastian Standil’s team at Scalr put the cloud infrastructure service through its paces — and were pleasantly surprised at what they found.

Google Compute Engine vs. Amazon EC2

Update: as a follow up to the preliminary benchmarks obtained and cited below, we are working on a new set of benchmarks that are more accurate and reflective of real-world use cases. In these tentative new benchmarks, the performance difference is less significant, and in some cases AWS may hold a lead. More to come and here: https://github.com/Scalr/perf-benchmarks/ .

At Scalr, we’ve been happy customers of Amazon’s infrastructure service, EC2, since 2007. In fact, we’ve built our tools for EC2 because we saw an opportunity to leverage its flexibility to help AWS customers easily design and manage resilient services. But as competitors spring up, we always test them to see how they compare, especially in regards to io performance.

On a warm June day in San Francisco, the Scalr team attended Google I/O 2012. Google was rumored to be launching a EC2 competitor, which we were interested in for our multi-cloud management software. It launched. And boy did it sound good. You see, EC2 and GCE offer pretty much the same core service, but Amazon has been plagued by poor network and disk performance, so Google’s promise to offer both higher and more consistent performance struck a real chord.

Not ones to be fooled by marketing-driven, hyped-up software, we applied for early access and were let in so we could start testing it ourselves. Once we got in, we felt like kids in a candy store. Google Compute Engine is not just fast. It’s Google fast. In fact, it’s a class of fast that enables new service architectures entirely. Here are the results from our tests, along with explanations of how GCE and EC2 differ, as well as comments and use cases.

A note about our data: The benchmarks run to collect the data presented here were taken twice a day, over four days, then averaged. When a high variance was observed, we took note of it and present it here as intervals for which 80 percent of observed data points fall into.

API

First off, GCE’s API is beautifully simple, explicit and easy to work with. Just take a look at it. Their firewalls are called “firewalls,” vlans are “networks,” and kernels are “kernels” (AKIs, anyone?). Anyone familiar with Unix will feel right at home.

Fast boot

Second, VMs are deployed and started with impressive speed (and we’ve extensively used 10 clouds). It routinely takes less than 30 seconds to login as root after making the insert call to launch a VM. As a reference point, this is the amount of time it takes AWS to get to the running state, after which you still need to wait for the OS to boot, for a total of 120 seconds on a good day, and 300 on a bad one (data points taken from us-east-1).

GCE vs. EC2: Boot times chart

Boot times are measured in seconds.

We don’t know what sort of sorcery Google does here, but they clearly demonstrate engineering prowess. That’s 4-10x faster.

Volumes

Those of you familiar with Amazon’s EBS volumes know that you can attach and detach volumes to any instance, anytime. On GCE, you can’t (at least not yet). This precludes you from switching drives to minimize downtime: attaching a volume on a running server allows you to skip the boot and configure stages of bringing a new node up, which is useful when promoting an existing mysql slave to master and you just need to swap out storage devices.

While GCE’s “disks” (as they call them) have that one disadvantage, they offer some unique advantages over Amazon volumes. For example, disks can be mounted read-only on multiple instances, which makes for more convenient fileserving than object stores, especially for software such as WordPress (see disclosure) or Drupal that expect a local filesystem. Disks are really fast too, and don’t seem to have the variable performance that plagued EBS before the introduction of Provisioned IOPS. See for yourself in the following benchmarks.

GCE EC2
Writes on ephemeral disk 157 MB/s 38-45 MB/s
Reads on ephemeral disk 93.3 MB/s 100-110 MB/s
Writes on persistent disks 84.5 MB/s 35-45 MB/s
Reads on persistent disks 98.9 MB/s 80-100 MB/s

As you can see, GCE and EC2 are equivalent on reads, but GCE is 2-4x faster on writes.

GCE vs. EC2: Read/write speeds

Read/write speeds are measured in MB/s. Higher numbers mean faster throughput.

Network

A short note about multi-cloud. I’m talking here about services that span multiple clouds, such as replicating a database from us-east-1 to us-west-1, for disaster recovery or latency-lowering purposes, not the multi-cloud management capabilities widely used in the enterprise. I believe that first kind of multi-cloud is a myth driven by the industry’s less tech-savvy folks. I’ve seen too many people attempt it unsuccessfully to recommend it: what usually happens is the slave database falls behind on the master, with an ever-increasing inconsistency window, because the load on the master exceeds the meager bandwidth available between master and slave. Our friends at Continuent are doing great work with Tungsten to accelerate that, but still.

Google’s network is so fast, however, that this kind of multi-cloud might just be possible. To illustrate the difference in speeds, we ran a bandwidth benchmark in which we copied a single, 500 Mb file between two regions. It took 242 seconds on AWS at an average speed of 15 Mbit/s, and 15 seconds on GCE with an average speed of 300Mbit/s. GCE came out 20x faster.

GCE vs. EC2: Bandwidth chart

Higher bandwidth is better and means faster up and downlinks.

After being so very much impressed, we made a latency benchmark between the same regions. We got an average of 20ms for GCE and 86ms for AWS. GCE came out 4x faster.

GCE vs. EC2: Latency benchmark chart

Lower latency is better and means shorter wait times.

This might allow new architectures, and high-load replicated databases might just become possible. Put a slave in different regions of the US (and if/when GCE goes international, why not different regions of the world?) to dramatically speed up SaaS applications for read performance.

Of course, unless Amazon and Google work together to enable Direct Connect, bandwidth from GCE to EC2 will still be slow. I also hear that Amazon is working on creating a private backbone between regions to enable the same use cases, which would be an expected smart move from them.

Multi-region images

We’re not quite sure why AWS doesn’t support this, but images on GCE are multi-region (“multi-zone” in their terms), that is to say when you snapshot an instance into an image, you can immediately launch new instances from that image in any region. This makes disaster recovery that much easier and makes their scheduled region maintenance (which occurs a couple of times a year) less of a problem. On that note, I’d also like to add that it forces people to plan their infrastructure to be multi-region, similar to what AWS did for instance failure by making local disk storage ephemeral.

So should you switch?

AWS offers an extremely comprehensive cloud service, with everything from DNS to database. Google does not. This makes building applications on AWS easier, since you have bigger building blocks. So if you don’t mind locking yourself into a vendor, you’ll be more productive on AWS.

But that said, with Google Compute Engine, AWS has a formidable new competitor in the public cloud space, and we’ll likely be moving some of Scalr’s production workloads from our hybrid aws-rackspace-softlayer setup to it when it leaves beta. There’s a strong technical case for migrating heavy workloads to GCE, and I’ll be grabbing popcorn to eagerly watch as the battle unfolds between the giants.

Sebastian Stadil is the founder of Scalr, a simple, powerful cloud management suite, and SVCCG, the world’s largest cloud computing user group. When not working on cloud, Sebastian enjoys making sushi and playing rugby.

Note: Data scientists from LinkedIn, Continuuity, Quantcast and NASA will talk about their hardware and software stacks at our “guru panel” at Structure:Data next week, March 20-21, in New York City.

Disclosure: Automattic, maker of WordPress.com, is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, GigaOM. Om Malik, founder of GigaOM, is also a venture partner at True.

Update: This story was updated at 7:34 a.m. PDT on May 15, 2013 to note that Scalr is working on a new set of benchmarks and will publish those results soon.

You’re subscribed! If you like, you can update your settings

  1. “. I’m talking here about services that span multiple clouds, such as replicating a database from us-east-1 to us-west-1, for disaster recovery or latency-lowering purposes, not the multi-cloud management capabilities widely used in the enterprise. I believe that first kind of multi-cloud is a myth driven by the industry’s less tech-savvy folks. I’ve seen too many people attempt it unsuccessfully to recommend it: what usually happens is the slave database falls behind on the master, with an ever-increasing inconsistency window, ”

    This statement is dependent on the type of database replication in question. Many modern eventually consistent data stores, such as Cassandra, make this a real possibility, even for production workloads.

    1. Would love to hear more about that, Ben. How does Cassandra work around this?

      1. Also, multi-cloud DB replication works just fine as long as you are not “web-scale” — which most companies & enterprises are not!

        In my personal direct experience upto 5MB/s traffic and upto 20K queries/second — multi-DC replication is just fine without any problem.

        AD

      2. @Ajit, I agree with you on the “web-scale” comment. I’ll add that in most cases, low-load databases don’t need replication over a wide area network, but if they do, they can probably get it to work, even if it might be a bit brittle.

      3. Since Cassandra is a masterless data store ( nodes know which nodes are responsible for what data ) the node that receives the initial request can proxy the request to the other datacenter. Since any node can accept writes for other nodes and they are based off of time stamps the merging of data based on time stamps and the ability for many nodes to receive a write request allows for better through put. Unlike in mysql which is basically a single thread per a database and a single receiver to process the requests in order, Cassandra allows replication to be processed by many nodes and in parallel.

        http://www.youtube.com/watch?v=u7nHyzFHqMA is a good video about the cassandra partioning and replication.

    2. Depends on who you are leaning on for multi cloud management capabilities. RightScale was born in 2006 pre EC2 and they have deployed somewhere in the ballpark of 6 million instances. Their Professional Services has a concentration on HA/DR and multi cloud management. Successful deployed multi cloud including DB countless time.
      Its feasible and doable just need the right resources to deploy a resilient architecture.

  2. David Mytton Friday, March 15, 2013

    Great writeup, although I’d like to see much more detailed benchmarks at greater sample rates across a longer time period – that’d show hourly and daily variability e.g. Lunchtime Friday vs early Tuesday morning.

    I don’t think Google do a good job at promoting GCE and their other AWS competitor products. Amazon continues to have the biggest market share because they just don’t stop improving things I’d love for Google to do this “properly” and really attack AWS on every front. They have the infrastructure and engineering capabilities to really get some innovative releases and pricing, but it still feels like they’re not that concerned about it.

    1. “I don’t think Google do a good job at promoting GCE”

      I’ll second that. While this is not really my space so I’m not always aware of developments, I was completely unaware that Google had an offering in this space (years ago, I wondered why they didn’t)

      1. Sebastian Stadil Gareth Friday, March 15, 2013

        GCE is still in beta, and I wouldn’t make a big bang if I wasn’t going to let anyone in.

        Larry Page’s “more wood behind fewer arrows” strategy is really showing here, with Reader and more being shut down, but the best and brightest folks in cloud being hired to work on it. Folks such as Jay Judkowitz and William Vambenepe, Martin Buhr and Navneet Joneja, Allan Naim and Jessie Jiang…

        @Gareth: the Google Cloud Platform team realized that Google App Engine was a very all-or-nothing approach to cloud. As soon as you had the littlest requirement that didn’t fit it, you had to move to EC2. GCE is their response.

  3. marcossilvapereira Friday, March 15, 2013

    Very informative writeup, but I missed a section about price. How AWS and GCE compare related to prices? AWS has reserved instances, per instance. Does GCE have something similar?

    1. Author Sebastian here: GCE does not have reserved instance equivalents. On demand pricing is similar, though.

    2. You can compare the prices for GCE (compute, storage, networking) with AWS and other clouds here: http://www.planforcloud.com/

  4. “500 Mb file” -> “500 MB file”

    1. I stand corrected. Thanks Dmitri.

  5. Gustavo Niemeyer Friday, March 15, 2013

    “Those of you familiar with Amazon’s EBS volumes know that you can attach and detach volumes to any instance, anytime. On GCE, you can’t (at least not yet).”

    You actually can now:

    https://developers.google.com/compute/docs/disks#attachdiskrunninginstance

    1. Thanks for the catch. I guess their development velocity is higher than my writing capacity!

  6. I used to have an east coast gigabit and west coast too on black fiber. And transfer rates were more like 600mbit, it all depends on the amount of load and ISP datacenters. So those speeds should vary. Would be nice if you placed the source of your tests so others can use those benchmarks to verify instead of just showing a google chart.

    1. I’ve been using Google’s storage servers for about 6 months now. It is lightning-fast, but we will only be able to deploy full use when they start providing proper invoices.
      Accounts dept. will just not accept their emailed “billing summary”.

      1. I concur with Jen here. They still have work on their plate!

      2. Google has been providing absolutely useless billing receipts/invoices for years and appears to have no desire to change. Try to identify a Google Site Search from a receipt for example. If I staple together 3-4 pages and highlight sets of purchase times, search engine IDs, and service subscription titles, it becomes an accountable invoice/receipt. I complain about this every time I get a feedback email/link/form. It has changed only slightly over the last 5 years.

    2. Hello. Could you please tell me more about the black fiber? Not familiar with the technical term. What kind of equipment terminated the circuit on your end? What kind of equipment and optics on your provider’s end?

      1. Black fiber = dark fiber, I’m guessing given context. Dark fiber is a dedicated network connection between two geographically separated points that doesn’t connect to the internet / internet backbone, so general internet congestion or network quality doesn’t have any impact on content sent through it.

        Honestly, I’m rather surprised Amazon hasn’t already got dark fiber between their regions.

        1. I was under the impression that dark fiber is fiber that was laid and then never used. There was a large build out of fiber at one time and a lot of that went dark.

      2. I think there might be build out issues such as permits that get in the way. Far from an expert here, so I can only speculate.

  7. I think the difference in throughput is b/c you’re comparing Google’s private backbone to Amazon sending traffic over the Internet to get between regions. I remember that James Hamilton, the guy who built out the Amazon network, explained that the “network is getting in the way” and that this is by design. Good news that Google gave us a better option.

    1. Sebastian Stadil GuestY Friday, March 15, 2013

      You are exactly right. This is comparison of private backbone vs public internet more than a comparison of AWS vs GCE. But from a practitioner’s point of view, it doesn’t matter.

      1. Except that a private backbone is going to be much more prone to failure/human error than the internet.

      2. I’d be stunned if Google doesn’t have a full mesh of dark fiber set up between their datacenters. If one route goes down the traffic would go via a different datacenter / route. That would be a geographical networking 101.

      3. @Twirrim: I would certainly hope that they have thought of that!

  8. AWS supports multi region images now.

    Be interesting if you threw in price here.

    The bandwidth sounds awesome but is that because no one is hardly using GCE or because they will always be that fast? EC2 does not guarantee what your bandwidth will be, you are limited to sharing with your “neighbors” I would imagine as GCE gains steam this would even out

    1. The difference with AWS image copy (a fantastic and long awaited addition), is that you have to request it. On GCE, you don’t have to.

      This is significant, because you no longer need to worry about sync, nor are you stuck if a region goes down and image copy no longer works.

      1. Automatic sync probably means higher price. So it’d be better to make it an option so people really can go for the cheapest possible price at the availability expense.

      2. In my experience, image storage costs are trivial. If you have a 2GB image synced across 5 regions, you are looking at $1.5/mo. The reduction in errors and convenience is more than worth it.

  9. Wait a second. You are testing us-east-1 (Virginia) to us-west-1 (San Francisco), so 86ms is the expected latency one would anticipate.

    Where are the two Google regions that you tested? I’m pretty sure its not Virginia-San Francisco unless Google has somehow acquire some alien wormhole technology from Area 51 to cut that time down to 20ms.

    Your data transfer results arent really fair since latency plays a key role in data transfers. 86ms vs 20ms makes a big difference.

    Perhaps you should try testing us-west-1 (San Francisco) and us-west-2 (Oregon), which I think might be near 20ms but I don’t know off hand. It would at least give a more accurate comparison.

    1. Yes, could you please clarify between what regions you are measuring to get 20ms latency? I assume this is round trip delay. Between San Francisco and Virginia is 2807 miles, that is about 15ms one way delay based on the speed of light. That means the round trip delay is at least 30ms. So what are you comparing?

      1. Sebastian Stadil sxw Monday, March 18, 2013

        We made a best effort to choose regions that were comparable, but we were not provided with the exact locations. When GCE leaves beta, I look forward to publishing updated numbers.

      1. Thanks for the addition, Jayan. We’ve observed similar latency for west coast AWS.

        For those who haven’t clicked his link, they are:

        TCPing

        Median : 31.2 milliseconds
        Mean : 31.7 milliseconds
        95th percentile : 31.2 milliseconds

        ICMP Ping

        Median : 20 milliseconds
        Mean : 20.6 milliseconds
        95th percentile : 25 milliseconds

  10. “We got an average of 20ms for GCE and 86ms for AWS. GCE came out 4x faster.”
    Well no duh! AWS has us-east and us-west. GCE has us-east and us-central.

    What instance sizes and ebs settings did you use for your tests?
    Did you do your tests in us-east, a well known full-to-capacity region or us-west2 that performs much better?

    And I wonder if following this article AWS will remove this http://aws.typepad.com/aws/2013/03/the-aws-report-sebastian-stadil-of-scalr.html ? ;)

    1. Sebastian Stadil Mxx Monday, March 18, 2013

      The AWS team has reached out to us for more information on our benchmarks; we’ve offered to have the conversation in the comments here, but they aren’t too keen on the idea it seems.

      So far that report is still online. Bets are off though. ;-)

      1. We’ll be providing an update shortly, and the AWS team is being very helpful.

Comments have been disabled for this post