39 Comments

Summary:

[qi:014] Update: A truck driver because of a medical condition drove into a power transformer in San Antonio, Texas, this evening, causing it to explode. That explosion caused a major power disruption, and the power company in response cut power, which ultimately and took down RackSpace, […]

[qi:014] Update: A truck driver because of a medical condition drove into a power transformer in San Antonio, Texas, this evening, causing it to explode. That explosion caused a major power disruption, and the power company in response cut power, which ultimately and took down RackSpace, our hosting company’s Dallas/Fort Worth based data center. Rackspace is based in San Antonio. This is the second time in less than a week that they have had power issues. Rackspace made the following announcement:

Without notifying us the utility providers cut power, and at that exact moment we were 15 minutes into cycling up the data center’s chillers. Our back up generators kicked in instantaneously, but the transfer to backup power triggered the chillers to stop cycling and then to begin cycling back up again—a process that would take on average 30 minutes. Those additional 30 minutes without chillers meant temperatures would rise to levels that could irreparably damage customers’ servers and devices. We made the decision to gradually pull servers offline before that would happen. And I know we made the right decision, even if it was a hard one to make.

Even though we are mostly hosted on WordPress.com, certain parts of the site are coming off of the RackSpace infrastructure. This prevented all our network sites from loading properly. Our email servers went down as well.

Everything seems to be back to normal, but it leaves me with one simple observation: our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track. A single truck driver can take out sites like 37Signals in a snap.

You’re subscribed! If you like, you can update your settings

  1. Lew Moorman, SVP Strategy, Rackspace Monday, November 12, 2007

    Om, we let you and many others down tonight. Bad luck or not, we failed to deliver what we promise. We also learned a lot about needing to communicate more in real time with customers. We are determined to earn back the trust lost tonight. We hope our customers, including you, give us that chance.

    lew

  2. Data Centers are expected to have redundant power sources and backup devices. A truck should not be able to knock off a data center – otherwise it is not designed or planned well enough!

  3. Well, our team was engaging key investors this Sunday/Monday and was also in the middle of our biggest outreach program since launch. And Rackspace let us down on Sunday morning for four hours (no server, email, nothing…emails bounced back! We basically didn’t exist!). And we were not even notified when it happened!

    And then, after 24 hours of me (CEO) explaining the situation to countless people…and assuring them that it was a rare one-off circumstance that would never happen again…IT HAPPENS AGAIN. Our server is still down right now.

    In all seriousness, this could destroy a business. Rackspace’s whole “zero downtime” guarantee has actually been almost 10 hours of downtime in the past 48 hours (not to mention GREAT costs to the credibility and revenues of many businesses out there including my team).

    What corners have they cut with back-up systems, generators, etc!? Truly destructive .

  4. well, get redundancy in data centers. The problem is that redundancy is non trivial to implement on both the software side and on the interconnection side and will cost. How much is your business worth to you? If you can’t do it properly or costs too much to do it yourself, host the site on people that have implemented redundancy for you – google or amazon web services et al. As for the backup power supplies, if you don’t test, chances are the backup isn’t as redundant as you thought it was – batteries die, breakers don’t break, switches fail.

  5. Rackspace has always been and is a and extremely hyped service. Scratch the surface at Rackspace and there is no quality. If you have 2 servers then Rackspace is ok, else they are their so called support is not worth it. And now this amazing failure!

  6. James Galvin » Blog Archive » Rackspace Outage Tuesday, November 13, 2007

    [...] was covered by Laughing Squid, and made it onto a lot of big tech news sites such as TechCrunch, GigaOm, Valleywag, and O’Reilly Radar. 37 Signals and other well known web companies got wiped off [...]

  7. I’ve worked on and off with Rackspace for almost 7 years and true to their claim I’ve never faced serious downtime issues. They also have sat patiently while addressing problems during server migrations, etc. with my IT staff.

    Personally I couldn’t imagine the embarrassment suffered from a CEO attempting to showcase their online business to investors only to find their server’s gone MIA. However, I also know that Rackspace’s 100% uptime guarantee comes with a solid SLA. One that in times like last week they will make due on.

    I wouldn’t let years of trustworthy service erode so quickly.

  8. Rackspace has showed that they are a marketing gimmick on steroids with these outages. A single truck hitting a power pole taking out their data center shows their lack of redundancy planning.

  9. You can’t possibly expect 100% uptime for a single location, regardless of the redundancy built into that infrastructure.

    This is why multi-site architectures (failover or active-active) are used by every for whom downtime really matters. And it is also why the 100% guarantee for Rackspace is only 100% guaranteed to ensure you will have SLA refunds.

  10. Christian Schlatter Tuesday, November 13, 2007

    Doesn’t RackSpace has UPS, diesel engines, and stuff … power outages can happen all the time. We’re testing this scenario once a month in our server rooms.

Comments have been disabled for this post