5 Comments

Summary:

Why is it that Heroku, Foursquare, Pinterest, Reddit, Instagram et al are still so heavily dependent on Amazon’s aging and problematic US-East region? Very good question. Here are some potential answers.

Amazon Web Services
photo: Flickr/Will Merydith

Why do tech-savvy companies like Heroku, Pinterest, AirBNB, Instagram, Reddit, Flipboard, and FourSquare keep so much of their computing horsepower running on Amazon’s aging US-East infrastructure given its problematic track record? US-East experienced big problems again Monday, impacting those sites and more. The latest snafu comes after other outages in June and earlier.

Why they’re sticking with US-East — especially since Amazon itself preaches distribution of loads across availability zones and  geographic regions — is the multimillion dollar question that no one at these companies is addressing publicly. But there are pretty safe bets as to their reasons. For one thing, Ashburn, VA-based US-East came online in 2006 and is Amazon’s oldest and biggest data center (or set of data centers).That’s why lot of big, legacy accounts run there.  Moving applications and workloads is complicated and expensive given data transfer fees. Face it, inertia hits us all — take a look at your own closets and you’ll probably agree. Moving is just not easy. Or fun.

Data gravity is one issue. “If you’ve been in US-East for a while, chances are you’ve built up a substantial amount of data in that region. It’s not always easy to move data around depending on how the applications are constructed,” said an industry exec who’s put a lot of workloads in Amazon and did not want to be identified.

In addition, the dirty little secret to the world at large is that many applications running on AWS “are really built with traditional data center architectures, so moving them around is akin to a data center migration — never an easy task in the best of circumstances,” he added. While most companies want to run applications and services in multiple venues, the complexity of doing so can be daunting, he said.  He pointed to a post-mortem of an April 2011 Heroku outage as an example.

US-East by default

Vittaly Tavor, founder and vice president of products for Cloudyn, a company that helps customers best utilize Amazon services, said the deck is still stacked in US-East’s favor nine months after the company’s new Oregon data center was activated. For one thing, the AWS console directs customers to US-East by default. So if you don’t know better, your stuff is going to go there, he said.

The US-West 2 data center, in Oregon, is newer but also smaller than US-East. Tavor suspects that Amazon may tell very large customers not to move there. “Oregon is much smaller than US East so if you’re a company of Heroku’s size and need to suddenly launch lots of instances, Oregon might be too small,” he said.  and, US West 1, in California, is more expensive than either of the other two because of the region’s higher energy and other costs.

For the record, as of Tuesday morning, Amazon was still sorting out residual issues from the problem — which surfaced there at 10:30 a.m. PDT — according to its status page:

4:21 AM PDT We are continuing to work on restoring IO for the remainder of affected volumes. This will take effect over the next few hours. While this process continues, customers may notice increased volume IO latency. The re-mirroring will proceed through the rest of today.

I have reached out to several of the affected companies and to Amazon itself and will update this if and when they respond. Of course, Amazon competitors are having a field day. Check out Joyent’s mash note to Reddit .

  1. I’m pretty sure the EBS outage affected only 1 of the 4 Availability Zones in US East 1 Region. Amazon preaches cross AZ balancing and scaling which they don’t charge extra for. If your application can’t do that, then maybe you shouldn’t be using AWS in the first place.

    Share
    1. hm, i had read more AZs than that but i’ll go back and ck…. doln’t they also counsel you to balance across geos as well?

      Share
      1. you’re right one AZ in ashburn

        Share
      2. Nope. They clearly state that their services are isolated per Region. You can’t use their Elastic Load Balancers and Auto-Scaling groups across Regions. If you wanted to go across Regions, you would need a 3rd party Global Load Balancer in place. Data replication would need to be dealt with as well.

        Share
  2. Well, using multiple zones is not exactly free: there’s some inter-zone fee, although it’s not too high.
    However all zones rely on a single control plane, so they are not exactly independent.
    Also, EBS volumes need to be replicated into at least two zones. When one zone goes down, and EBS volumes in other zones sense that there’s no available replica, they all start the replication process at the same time to other zones. It’s exactly that mechanism that brought us-east-1 to its knees during the previous outage.

    Share

Comments have been disabled for this post