8 Comments

Summary:

The popular web sites down today because of the Amazon Web Services outage are getting most of the attention, but they were only a handful among the hundred-plus that were affected, including at least three popular platform-as-a-service providers — Heroku, Engine Yard and DotCloud.

ec2disabled

Most of the attention around the Amazon Web Services outage focused on the more popular sites. But they were only a handful among the hundred-plus that were affected, including at least three popular platform-as-a-service providers: Heroku, Engine Yard and DotCloud. That’s because many Platform-as-a-Service (PaaS) offerings are hosted with AWS, essentially adding a developer-friendly layer of abstraction of the AWS infrastructure to make writing and deploying applications even easier than with AWS. Of course, the downfall is that as goes AWS, so goes your PaaS provider.

And that’s exactly what happened today with two very very popular PaaS offerings — Heroku and Engine Yard — and the up-and-coming DotCloud. InformationWeek detailed the Engine Yard situation, which was mitigated by its very, very wise decision to have begun utilizing a multi-region failover strategy that encompasses data centers outside the US-EAST region that was affected today. Heroku has been issuing status updates all day long as it tries to get its service back up and running and AWS makes improvements on its end. In a very informative blog entry, DotCloud detailed how one goes about achieving maximum availability with AWS. It also talked about what can and did go wrong, alluding to plans for preventing similar problems for DotCloud users in the future.

As you might have noted in the myriad headlines generated by this outage, though, PaaS providers are hardly the most noteworthy services that were down today. An impromptu site called http://ec2disabled.com/ has compiled a list of sites (145 as of 4:40 p.m. PDT) that were affected by the outage, ranging from About.me to Zencoder. (A big hat tip to Ruven Cohen for pointing me to this URL.)

As for the status at ground zero, AWS’s US-EAST region, things seem to be getting better. The last update about the EC2 and Elastic Block Storage services, at 1:48 p.m. PDT read:

1:48 PM PDT A single Availability Zone in the US-EAST-1 Region continues to experience problems launching EBS backed instances or creating volumes. All other Availability Zones are operating normally. Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.

Relational Database Service customers have something to smile about, too:

2:35 PM PDT We have restored access to the majority of RDS Multi AZ instances and continue to work on the remaining affected instances. A single Availability Zone in the US-EAST-1 region continues to experience problems for launching new RDS database instances. All other Availability Zones are operating normally. Customers with snapshots/backups of their instances in the affected Availability zone can restore them into another zone. We recommend that customers do not target a specific Availability Zone when creating or restoring new RDS database instances. We have updated our service to avoid placing any RDS instances in the impaired zone for untargeted requests.

As Stacey noted this morning, though, the response from the affected sites has been pretty good-natured, probably because, as Quora astutely pointed out on its error message: “we wouldn’t be where we are today without EC2.” I think many sites feel the same way, and they won’t be abandoning AWS anytime soon — if only because there aren’t necessarily many, if any, better options in terms of availability — but, like DotCloud, they’ll start thinking about some advanced failover options if they really want their customers to take them seriously going forward.

You’re subscribed! If you like, you can update your settings

  1. Good summary, Om.

    You can’t count on Amazon for your disaster recovery, nor perhaps your PaaS provider either. That’s not to say quit using Amazon or PaaS’s, just that your customers will make you responsible. There have been mechanisms available for a while now that would’ve mitigated a lot of these troubles. Companies like Netflix have used them and been fine.

    More on Smoothspan:

    http://smoothspan.wordpress.com/2011/04/21/what-to-do-when-your-cloud-is-down/

    Cheers,

    BW

    1. Thanks Bob. Though Derrick Harris wrote the post. Loved your post.

  2. SimpleGeo didn’t go down. Our apparently unique architecture assumes core AWS services can, and will, fail.

    1. Joe

      Can you share what things your did in order to make the AWS infrastructure work for you guys?

  3. ElasticHosts cloud servers Friday, April 22, 2011

    For all customers affected by EC2 downtime, I would like to recommend ElasticHosts as an alternative cloud service (www.elastichosts.com) – we offer a 5 day free trial for our cloud servers in US or UK, which is likely enough at least to bridge the gap.

  4. If security and reliability is important to your customers you should still consider dedicated hardware. It takes quite a lot of extra work but in our case we are happy, also in terms of availability and costs.

  5. No infrastructure is bulletproof. I think the key is that services that do host with AWS, do they have a failover to another cloud? One of the reasons our LongJump PaaS (http://longjump.com) remains hosted on a managed provider rather than purely in an IaaS is that it is still possible for us to switch over to another set of servers. So even though the possibility exists for a server-related shutdown, we can at least recover on our own and not wait for the “Cloud to Clear.”

    But I should note that we’re not bad-mouthing any IaaS. We use them all the time for development and testing. But since our PaaS is our bread and butter, we just tend to be a bit more traditional on the server side. Hosting on private servers, while more of an investment and less elastic, is also the least risky.

  6. Scott McDonald Tuesday, April 26, 2011

    Good points – also note its not that hard to failover to another cloud site. I recommend using 2 different cloud vendors (EC2 and Linode.com) – setting up mysql replication between the instances, automate rsync over ssh for web directories, so that both sites are always hot – and then use DNS Failover techniques to automatically redirect traffic between them.

    But I’m biased – as I offer DNS Failover services anyone can afford at dnshat.com so take this with a grain of salt – but I’ve seen this type of automated failover setup work great for many clients using traditional lamp stacks.

Comments have been disabled for this post