Blog Post

By the numbers: How Google Compute Engine stacks up to Amazon EC2

Update: as a follow up to the preliminary benchmarks obtained and cited below, we are working on a new set of benchmarks that are more accurate and reflective of real-world use cases. In these tentative new benchmarks, the performance difference is less significant, and in some cases AWS may hold a lead. More to come and here: https://github.com/Scalr/perf-benchmarks/ .

At Scalr, we’ve been happy customers of Amazon’s (s AMZN) infrastructure service, EC2, since 2007. In fact, we’ve built our tools for EC2 because we saw an opportunity to leverage its flexibility to help AWS customers easily design and manage resilient services. But as competitors spring up, we always test them to see how they compare, especially in regards to io performance.

On a warm June day in San Francisco, the Scalr team attended Google (s GOOG) I/O 2012. Google was rumored to be launching a EC2 competitor, which we were interested in for our multi-cloud management software. It launched. And boy did it sound good. You see, EC2 and GCE offer pretty much the same core service, but Amazon has been plagued by poor network and disk performance, so Google’s promise to offer both higher and more consistent performance struck a real chord.

Not ones to be fooled by marketing-driven, hyped-up software, we applied for early access and were let in so we could start testing it ourselves. Once we got in, we felt like kids in a candy store. Google Compute Engine is not just fast. It’s Google fast. In fact, it’s a class of fast that enables new service architectures entirely. Here are the results from our tests, along with explanations of how GCE and EC2 differ, as well as comments and use cases.

A note about our data: The benchmarks run to collect the data presented here were taken twice a day, over four days, then averaged. When a high variance was observed, we took note of it and present it here as intervals for which 80 percent of observed data points fall into.

API

First off, GCE’s API is beautifully simple, explicit and easy to work with. Just take a look at it. Their firewalls are called “firewalls,” vlans are “networks,” and kernels are “kernels” (AKIs, anyone?). Anyone familiar with Unix will feel right at home.

Fast boot

Second, VMs are deployed and started with impressive speed (and we’ve extensively used 10 clouds). It routinely takes less than 30 seconds to login as root after making the insert call to launch a VM. As a reference point, this is the amount of time it takes AWS to get to the running state, after which you still need to wait for the OS to boot, for a total of 120 seconds on a good day, and 300 on a bad one (data points taken from us-east-1).

GCE vs. EC2: Boot times chart
Boot times are measured in seconds.

We don’t know what sort of sorcery Google does here, but they clearly demonstrate engineering prowess. That’s 4-10x faster.

Volumes

Those of you familiar with Amazon’s EBS volumes know that you can attach and detach volumes to any instance, anytime. On GCE, you can’t (at least not yet). This precludes you from switching drives to minimize downtime: attaching a volume on a running server allows you to skip the boot and configure stages of bringing a new node up, which is useful when promoting an existing mysql slave to master and you just need to swap out storage devices.

While GCE’s “disks” (as they call them) have that one disadvantage, they offer some unique advantages over Amazon volumes. For example, disks can be mounted read-only on multiple instances, which makes for more convenient fileserving than object stores, especially for software such as WordPress (see disclosure) or Drupal that expect a local filesystem. Disks are really fast too, and don’t seem to have the variable performance that plagued EBS before the introduction of Provisioned IOPS. See for yourself in the following benchmarks.

GCE EC2
Writes on ephemeral disk 157 MB/s 38-45 MB/s
Reads on ephemeral disk 93.3 MB/s 100-110 MB/s
Writes on persistent disks 84.5 MB/s 35-45 MB/s
Reads on persistent disks 98.9 MB/s 80-100 MB/s

As you can see, GCE and EC2 are equivalent on reads, but GCE is 2-4x faster on writes.

GCE vs. EC2: Read/write speeds
Read/write speeds are measured in MB/s. Higher numbers mean faster throughput.

Network

A short note about multi-cloud. I’m talking here about services that span multiple clouds, such as replicating a database from us-east-1 to us-west-1, for disaster recovery or latency-lowering purposes, not the multi-cloud management capabilities widely used in the enterprise. I believe that first kind of multi-cloud is a myth driven by the industry’s less tech-savvy folks. I’ve seen too many people attempt it unsuccessfully to recommend it: what usually happens is the slave database falls behind on the master, with an ever-increasing inconsistency window, because the load on the master exceeds the meager bandwidth available between master and slave. Our friends at Continuent are doing great work with Tungsten to accelerate that, but still.

Google’s network is so fast, however, that this kind of multi-cloud might just be possible. To illustrate the difference in speeds, we ran a bandwidth benchmark in which we copied a single, 500 Mb file between two regions. It took 242 seconds on AWS at an average speed of 15 Mbit/s, and 15 seconds on GCE with an average speed of 300Mbit/s. GCE came out 20x faster.

GCE vs. EC2: Bandwidth chart
Higher bandwidth is better and means faster up and downlinks.

After being so very much impressed, we made a latency benchmark between the same regions. We got an average of 20ms for GCE and 86ms for AWS. GCE came out 4x faster.

GCE vs. EC2: Latency benchmark chart
Lower latency is better and means shorter wait times.

This might allow new architectures, and high-load replicated databases might just become possible. Put a slave in different regions of the US (and if/when GCE goes international, why not different regions of the world?) to dramatically speed up SaaS applications for read performance.

Of course, unless Amazon and Google work together to enable Direct Connect, bandwidth from GCE to EC2 will still be slow. I also hear that Amazon is working on creating a private backbone between regions to enable the same use cases, which would be an expected smart move from them.

Multi-region images

We’re not quite sure why AWS doesn’t support this, but images on GCE are multi-region (“multi-zone” in their terms), that is to say when you snapshot an instance into an image, you can immediately launch new instances from that image in any region. This makes disaster recovery that much easier and makes their scheduled region maintenance (which occurs a couple of times a year) less of a problem. On that note, I’d also like to add that it forces people to plan their infrastructure to be multi-region, similar to what AWS did for instance failure by making local disk storage ephemeral.

So should you switch?

AWS offers an extremely comprehensive cloud service, with everything from DNS to database. Google does not. This makes building applications on AWS easier, since you have bigger building blocks. So if you don’t mind locking yourself into a vendor, you’ll be more productive on AWS.

But that said, with Google Compute Engine, AWS has a formidable new competitor in the public cloud space, and we’ll likely be moving some of Scalr’s production workloads from our hybrid aws-rackspace-softlayer setup to it when it leaves beta. There’s a strong technical case for migrating heavy workloads to GCE, and I’ll be grabbing popcorn to eagerly watch as the battle unfolds between the giants.

Sebastian Stadil is the founder of Scalr, a simple, powerful cloud management suite, and SVCCG, the world’s largest cloud computing user group. When not working on cloud, Sebastian enjoys making sushi and playing rugby.

Note: Data scientists from LinkedIn, Continuuity, Quantcast and NASA will talk about their hardware and software stacks at our “guru panel” at Structure:Data next week, March 20-21, in New York City.

Disclosure: Automattic, maker of WordPress.com, is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, GigaOM. Om Malik, founder of GigaOM, is also a venture partner at True.

Update: This story was updated at 7:34 a.m. PDT on May 15, 2013 to note that Scalr is working on a new set of benchmarks and will publish those results soon.

73 Responses to “By the numbers: How Google Compute Engine stacks up to Amazon EC2”

  1. sebastianstadil

    Quick update–

    Following this post, we’ve been working with many parties in coming up with benchmarks that are more representative of real world use cases, and more accurate measurements.

    In preliminary results we’ve obtained, it seems like the performance difference is less than we initially measured: AWS turns out faster in many cases.

    Stay tuned for some updated results!

  2. Re. “Fast Boot”

    We don’t know what sort of sorcery Google does here, but they clearly demonstrate engineering prowess. That’s 4-10x faster.

    The speed of light is still the same the Googleverse so no sorcery – just some (currently) secret sauce. I can think of two possibilities:
    GCE is taking an already booted VM, reconfiguring network and allocating to you.
    Alternatively they could be configuring VM with incredible CPU speed / cores initially which gets throttled back invisibly as boot nears completion.
    Either way – nice!

    Any Googlers want to comment?

  3. Doc Dawning

    Gee, I didn’t even realize Google had something like this up and running.. AND I even care about these sorts of things. I’m a long time lover of Google, but it feels like they’re losing something lately? Maybe just a valley, with a fresh peak on the way.. I use EC2 a fair bit. I think it’s overly complicated for my use cases and kind of expensive, though once I decided to commit to a Reserved Instance, I stopped caring about price.

  4. Apparently, Google’s services are not aimed for very advanced system engineers/administrator (whatever you want to call them). Many of our customers use AWS, none so far uses Google Compute. Let’s see if that changes in the next few months…

  5. We at GenieDB have been working in GCE together with one of our customers and at least the network information presented above needs review.

    GCE shows us us-central (with 3 zones) and eu-west (2 zones) datacenters. A few networking related observations:

    1) Two Instances within us-central show a ping time of 20ms. That seems high for the same region.
    2) us-central and eu-west shows a ping time of 116ms. This looks better than usual. Trans-Atlantic is ~90ms, so depending on the real location of us-central, this is better than expected 135ms (assuming Kansas!) or higher (more westwardly).
    3) Most interestingly… can’t tell the real location of any of these datacenters easily, because traceroutes to both these Instances seem to take us through Santa Clara! Is there some kind of Geo-DNS (or Latency Based Routing) in play?? That would be awesome!

    As for Distributed Databases and Multi-cloud deployments, GenieDB is solving that exact problem. This is important for businesses because it allows for better response time for their users and availability in face of datacenter failures. For our first release, we have an ‘Eventually Consistent’ model. We tag each row/operation in the database with a temporal timestamp that allows us to determine a order of operations. This makes it possible for us to ‘heal’ the distributed copies of the database to consistent state when the network connection is restored after a hiccup. While the database copies are disconnected, you still have the option to allow the database to run in a ‘split-brain’ mode. We have a demo up at http://www.geniedb.com/demo that shows just that. We would be happy to setup such a deployment for anyone interested.

  6. Paul Otto

    @SXW – “Yes, could you please clarify between what regions you are measuring to get 20ms latency? I assume this is round trip delay. Between San Francisco and Virginia is 2807 miles, that is about 15ms one way delay based on the speed of light. That means the round trip delay is at least 30ms. So what are you comparing?”

    I’m hoping that Google has abandoned ethernet in favor of a new “tachyon-net” — maybe when beta period is over we’ll see latency reported in negative numbers.

    • Sebastian Stadil

      Azure recently added the ability to launch linux instances, which makes it closer to Google and Amazon in functionality. We haven’t had a chance to use it however.

    • Sebastian Stadil

      We initially did this for internal purposes, and thought the community would be interested. I’d love to add Rackspace to the mix too.

  7. The EC2 numbers are not even close to the truth. I guess the author just made them up. Simple examples:

    1. Writes on ephemeral disk for EC2 — If you use any reasonable disk benchmark tool, it should be >=70MB/s per disk (meaning that if you use m1.large which has two ephemeral disks, you’ll get 150MB/s after stripping).

    2. EC2 boot speed: I guess the author simply doesn’t know how to boot up instance from ebs. It takes at most 30 sec!

    3. The average latency between EC2 instances are much less than 80ms. In fact, only 1%-2% are as high as about 50ms; others are all below 5ms.

    With this, I have doubt the credibility of the author, Sebastian Stadil, as well as his compnay Scalr.

    • Sebastian Stadil

      Please avoid ad hominem attacks.

      1. Perhaps we did something wrong then. Can you suggest a methodology to get the results you describe?

      2. We measured time from API call to login prompt on an m1.small.

      3. Do you mean between EC2 regions? Can you elaborate on how you get those results between us-east and us-west?

  8. GCE cloud is built upon Linux-KVM. Amazon EC2 is built upon Linux-Xen. Based on my own tests and real world implementations of these two hypervisors and their corresponding ecosystems, I’m going to make an educated guess and say that the performance difference noted between these two public clouds is in no small part related to this fundamental architectural difference.

  9. Sebastian,

    This is super helpful. We’re big fans of both systems and I think GCE has the potential to be the second major player in public clouds.

    Would you mind being a bit more specific on your disk benchmark numbers? Without knowing the block sizes you were using, these numbers aren’t really very useful. There isn’t any way to get actual IOPS numbers. I would very much appreciate it if you could clarify your testing methodology for disks.

    Best,

    –Randy

    • Sebastian Stadil

      Hi Randy, good to see you here. Completely agree with you in GCE having the potential to be a major player in public clouds too.

      We removed all the technical details to avoid alienating GigaOm’s audience, but comments seems like a good place for this.

      To measure disk IO, we used `dd if=/dev/zero of=/` with bs of 1024 and 4096.

      To measure network IO, we transferred 500MB files using three methods:
      – File served by Apache (HTTP overhead)
      – SCP file (SSH overhead)
      – dd from /dev/zero to netcat (no overhead).

      We did not warm up either the GCE disks or EBS volumes.

  10. Thanks for measuring, but your lack of a coherent methodology implies that others have to do the real work.

    Reporting averages without variance or stdev is pretty useless.

    But no worries, most devs aren’t really all that competent with testing and stats anyways.

    • Sebastian Stadil

      There will always be a call for more rigor, and since we’re not academics nor in the business of benchmarking clouds, this is what we settled on. We ran these benchmarks for internal purposes, then thought others would like to see them too.

      I also argue that GigaOm has a tech-savvy yet still general audience, and a full paper on the subject would not have been a match for it.

    • Sebastian Stadil

      We intentionally did not. We feel that this would be Amazon’s job, to give users a better experience. “Warmed up” volumes do indeed perform better.

    • Sebastian Stadil

      That’s a valid point, Jeff. Make sure you use cloud abstraction / cloud management software to reduce the risks thereof.

      Disclaimer: my company Scalr does this.

  11. ars_technician

    Good write-up. The only problem is that I get the impression that you are comparing the congestion between two interstates, one of which doesn’t have any public traffic on it yet.

    Of course GCE is going to have better throughput, the network is essentially empty at this point.

    • Sebastian Stadil

      Although I have no knowledge of Google network internals, I would assume they don’t have datacenters dedicated to GCE, but rather have a service like Borg define where it runs.

  12. “We got an average of 20ms for GCE and 86ms for AWS. GCE came out 4x faster.”
    Well no duh! AWS has us-east and us-west. GCE has us-east and us-central.

    What instance sizes and ebs settings did you use for your tests?
    Did you do your tests in us-east, a well known full-to-capacity region or us-west2 that performs much better?

    And I wonder if following this article AWS will remove this http://aws.typepad.com/aws/2013/03/the-aws-report-sebastian-stadil-of-scalr.html ? ;)

    • Sebastian Stadil

      The AWS team has reached out to us for more information on our benchmarks; we’ve offered to have the conversation in the comments here, but they aren’t too keen on the idea it seems.

      So far that report is still online. Bets are off though. ;-)

  13. NuageChat

    Wait a second. You are testing us-east-1 (Virginia) to us-west-1 (San Francisco), so 86ms is the expected latency one would anticipate.

    Where are the two Google regions that you tested? I’m pretty sure its not Virginia-San Francisco unless Google has somehow acquire some alien wormhole technology from Area 51 to cut that time down to 20ms.

    Your data transfer results arent really fair since latency plays a key role in data transfers. 86ms vs 20ms makes a big difference.

    Perhaps you should try testing us-west-1 (San Francisco) and us-west-2 (Oregon), which I think might be near 20ms but I don’t know off hand. It would at least give a more accurate comparison.

  14. rschmitty20

    AWS supports multi region images now.

    Be interesting if you threw in price here.

    The bandwidth sounds awesome but is that because no one is hardly using GCE or because they will always be that fast? EC2 does not guarantee what your bandwidth will be, you are limited to sharing with your “neighbors” I would imagine as GCE gains steam this would even out

    • Sebastian Stadil

      The difference with AWS image copy (a fantastic and long awaited addition), is that you have to request it. On GCE, you don’t have to.

      This is significant, because you no longer need to worry about sync, nor are you stuck if a region goes down and image copy no longer works.

      • dlmaniac

        Automatic sync probably means higher price. So it’d be better to make it an option so people really can go for the cheapest possible price at the availability expense.

      • Sebastian Stadil

        In my experience, image storage costs are trivial. If you have a 2GB image synced across 5 regions, you are looking at $1.5/mo. The reduction in errors and convenience is more than worth it.

  15. I think the difference in throughput is b/c you’re comparing Google’s private backbone to Amazon sending traffic over the Internet to get between regions. I remember that James Hamilton, the guy who built out the Amazon network, explained that the “network is getting in the way” and that this is by design. Good news that Google gave us a better option.

    • Sebastian Stadil

      You are exactly right. This is comparison of private backbone vs public internet more than a comparison of AWS vs GCE. But from a practitioner’s point of view, it doesn’t matter.

  16. John A.

    I used to have an east coast gigabit and west coast too on black fiber. And transfer rates were more like 600mbit, it all depends on the amount of load and ISP datacenters. So those speeds should vary. Would be nice if you placed the source of your tests so others can use those benchmarks to verify instead of just showing a google chart.

    • Jen Brannstrom

      I’ve been using Google’s storage servers for about 6 months now. It is lightning-fast, but we will only be able to deploy full use when they start providing proper invoices.
      Accounts dept. will just not accept their emailed “billing summary”.

      • Google has been providing absolutely useless billing receipts/invoices for years and appears to have no desire to change. Try to identify a Google Site Search from a receipt for example. If I staple together 3-4 pages and highlight sets of purchase times, search engine IDs, and service subscription titles, it becomes an accountable invoice/receipt. I complain about this every time I get a feedback email/link/form. It has changed only slightly over the last 5 years.

    • BLACK_MAN

      Hello. Could you please tell me more about the black fiber? Not familiar with the technical term. What kind of equipment terminated the circuit on your end? What kind of equipment and optics on your provider’s end?

      • Twirrim

        Black fiber = dark fiber, I’m guessing given context. Dark fiber is a dedicated network connection between two geographically separated points that doesn’t connect to the internet / internet backbone, so general internet congestion or network quality doesn’t have any impact on content sent through it.

        Honestly, I’m rather surprised Amazon hasn’t already got dark fiber between their regions.

        • I was under the impression that dark fiber is fiber that was laid and then never used. There was a large build out of fiber at one time and a lot of that went dark.

  17. marcossilvapereira

    Very informative writeup, but I missed a section about price. How AWS and GCE compare related to prices? AWS has reserved instances, per instance. Does GCE have something similar?

  18. Great writeup, although I’d like to see much more detailed benchmarks at greater sample rates across a longer time period – that’d show hourly and daily variability e.g. Lunchtime Friday vs early Tuesday morning.

    I don’t think Google do a good job at promoting GCE and their other AWS competitor products. Amazon continues to have the biggest market share because they just don’t stop improving things I’d love for Google to do this “properly” and really attack AWS on every front. They have the infrastructure and engineering capabilities to really get some innovative releases and pricing, but it still feels like they’re not that concerned about it.

    • “I don’t think Google do a good job at promoting GCE”

      I’ll second that. While this is not really my space so I’m not always aware of developments, I was completely unaware that Google had an offering in this space (years ago, I wondered why they didn’t)

      • Sebastian Stadil

        GCE is still in beta, and I wouldn’t make a big bang if I wasn’t going to let anyone in.

        Larry Page’s “more wood behind fewer arrows” strategy is really showing here, with Reader and more being shut down, but the best and brightest folks in cloud being hired to work on it. Folks such as Jay Judkowitz and William Vambenepe, Martin Buhr and Navneet Joneja, Allan Naim and Jessie Jiang…

        @Gareth: the Google Cloud Platform team realized that Google App Engine was a very all-or-nothing approach to cloud. As soon as you had the littlest requirement that didn’t fit it, you had to move to EC2. GCE is their response.

  19. Ben Whaley

    “. I’m talking here about services that span multiple clouds, such as replicating a database from us-east-1 to us-west-1, for disaster recovery or latency-lowering purposes, not the multi-cloud management capabilities widely used in the enterprise. I believe that first kind of multi-cloud is a myth driven by the industry’s less tech-savvy folks. I’ve seen too many people attempt it unsuccessfully to recommend it: what usually happens is the slave database falls behind on the master, with an ever-increasing inconsistency window, ”

    This statement is dependent on the type of database replication in question. Many modern eventually consistent data stores, such as Cassandra, make this a real possibility, even for production workloads.

      • Ajit Deshpande

        Also, multi-cloud DB replication works just fine as long as you are not “web-scale” — which most companies & enterprises are not!

        In my personal direct experience upto 5MB/s traffic and upto 20K queries/second — multi-DC replication is just fine without any problem.

        AD

      • Sebastian Stadil

        @Ajit, I agree with you on the “web-scale” comment. I’ll add that in most cases, low-load databases don’t need replication over a wide area network, but if they do, they can probably get it to work, even if it might be a bit brittle.

      • Terrance

        Since Cassandra is a masterless data store ( nodes know which nodes are responsible for what data ) the node that receives the initial request can proxy the request to the other datacenter. Since any node can accept writes for other nodes and they are based off of time stamps the merging of data based on time stamps and the ability for many nodes to receive a write request allows for better through put. Unlike in mysql which is basically a single thread per a database and a single receiver to process the requests in order, Cassandra allows replication to be processed by many nodes and in parallel.

        http://www.youtube.com/watch?v=u7nHyzFHqMA is a good video about the cassandra partioning and replication.

    • Depends on who you are leaning on for multi cloud management capabilities. RightScale was born in 2006 pre EC2 and they have deployed somewhere in the ballpark of 6 million instances. Their Professional Services has a concentration on HA/DR and multi cloud management. Successful deployed multi cloud including DB countless time.
      Its feasible and doable just need the right resources to deploy a resilient architecture.