Blog Post

Cloud Platforms Heroku, DotCloud & Engine Yard Hit Hard By Amazon Outage

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Most of the attention around the Amazon Web Services (s amzn) outage focused on the more popular sites. But they were only a handful among the hundred-plus that were affected, including at least three popular platform-as-a-service providers: Heroku, Engine Yard and DotCloud. That’s because many Platform-as-a-Service (PaaS) offerings are hosted with AWS, essentially adding a developer-friendly layer of abstraction of the AWS infrastructure to make writing and deploying applications even easier than with AWS. Of course, the downfall is that as goes AWS, so goes your PaaS provider.

And that’s exactly what happened today with two very very popular PaaS offerings — Heroku and Engine Yard — and the up-and-coming DotCloud. InformationWeek detailed the Engine Yard situation, which was mitigated by its very, very wise decision to have begun utilizing a multi-region failover strategy that encompasses data centers outside the US-EAST region that was affected today. Heroku has been issuing status updates all day long as it tries to get its service back up and running and AWS makes improvements on its end. In a very informative blog entry, DotCloud detailed how one goes about achieving maximum availability with AWS. It also talked about what can and did go wrong, alluding to plans for preventing similar problems for DotCloud users in the future.

As you might have noted in the myriad headlines generated by this outage, though, PaaS providers are hardly the most noteworthy services that were down today. An impromptu site called has compiled a list of sites (145 as of 4:40 p.m. PDT) that were affected by the outage, ranging from to Zencoder. (A big hat tip to Ruven Cohen for pointing me to this URL.)

As for the status at ground zero, AWS’s US-EAST region, things seem to be getting better. The last update about the EC2 and Elastic Block Storage services, at 1:48 p.m. PDT read:

1:48 PM PDT A single Availability Zone in the US-EAST-1 Region continues to experience problems launching EBS backed instances or creating volumes. All other Availability Zones are operating normally. Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.

Relational Database Service customers have something to smile about, too:

2:35 PM PDT We have restored access to the majority of RDS Multi AZ instances and continue to work on the remaining affected instances. A single Availability Zone in the US-EAST-1 region continues to experience problems for launching new RDS database instances. All other Availability Zones are operating normally. Customers with snapshots/backups of their instances in the affected Availability zone can restore them into another zone. We recommend that customers do not target a specific Availability Zone when creating or restoring new RDS database instances. We have updated our service to avoid placing any RDS instances in the impaired zone for untargeted requests.

As Stacey noted this morning, though, the response from the affected sites has been pretty good-natured, probably because, as Quora astutely pointed out on its error message: “we wouldn’t be where we are today without EC2.” I think many sites feel the same way, and they won’t be abandoning AWS anytime soon — if only because there aren’t necessarily many, if any, better options in terms of availability — but, like DotCloud, they’ll start thinking about some advanced failover options if they really want their customers to take them seriously going forward.

8 Responses to “Cloud Platforms Heroku, DotCloud & Engine Yard Hit Hard By Amazon Outage”

  1. Good points – also note its not that hard to failover to another cloud site. I recommend using 2 different cloud vendors (EC2 and – setting up mysql replication between the instances, automate rsync over ssh for web directories, so that both sites are always hot – and then use DNS Failover techniques to automatically redirect traffic between them.

    But I’m biased – as I offer DNS Failover services anyone can afford at so take this with a grain of salt – but I’ve seen this type of automated failover setup work great for many clients using traditional lamp stacks.

  2. No infrastructure is bulletproof. I think the key is that services that do host with AWS, do they have a failover to another cloud? One of the reasons our LongJump PaaS ( remains hosted on a managed provider rather than purely in an IaaS is that it is still possible for us to switch over to another set of servers. So even though the possibility exists for a server-related shutdown, we can at least recover on our own and not wait for the “Cloud to Clear.”

    But I should note that we’re not bad-mouthing any IaaS. We use them all the time for development and testing. But since our PaaS is our bread and butter, we just tend to be a bit more traditional on the server side. Hosting on private servers, while more of an investment and less elastic, is also the least risky.

  3. If security and reliability is important to your customers you should still consider dedicated hardware. It takes quite a lot of extra work but in our case we are happy, also in terms of availability and costs.

  4. Good summary, Om.

    You can’t count on Amazon for your disaster recovery, nor perhaps your PaaS provider either. That’s not to say quit using Amazon or PaaS’s, just that your customers will make you responsible. There have been mechanisms available for a while now that would’ve mitigated a lot of these troubles. Companies like Netflix have used them and been fine.

    More on Smoothspan: