A significant Amazon Web Services outage, which took down popular sites including Heroku for hours late Thursday, shows the risk of putting too many loads in one data center. While this outage occurred in Amazon’s cloud, it wasn’t just a cloud-specific problem. It shows that building in redundancy is critical — whether your app runs in your own data center or in someone else’s cloud.
In short, AWS users should make sure their workloads run across AWS regions to prevent future snafus.
As Om reported earlier, Amazon attributed the failure to a power outage affecting its U.S. East data center in Virginia. That makes sense, US East is Amazon’s oldest and biggest data center. It suffered another major outage in April 2011 and was also beset by performance problems as Amazon rebooted thousands of EC2 instances in December.
Amazon is notoriously tight-lipped about its data centers, but in March Accenture analyst Huan Liu used his own techniques to come up with pretty impressive stats about the inner workings of AWS including the relative size of its data centers. U.S. East is the largest by far (see Huan Liu’s chart below).
Outages like this, which first showed up on the Amazon Web service dashboard at 8:50 p.m. Pacific time and was declared resolved at 3:26 a.m. Pacific time, draw lots of headlines and posturing — cloud competitors leapt onto Twitter to state that their sites were up and running — but cloud experts warned against over reaction. (AWS VP and CTO Werner Vogels will speak at next week’s GigaOM Structure conference.)
This is a tempest in a teapot, said Carl Brooks, analyst at Tier1 Research. “AWS outages are still magnified out of proportion to their severity. It doesn’t help their credibility with the paleoconservative enterprise paranoid who will use this as an excuse to buy more absurdly overpriced IT from the usual suspects.”
In other words, take a deep breath. And make sure you design your AWS workloads to run across geographies.