39 Comments

Summary:

[qi:014] Update: A truck driver because of a medical condition drove into a power transformer in San Antonio, Texas, this evening, causing it to explode. That explosion caused a major power disruption, and the power company in response cut power, which ultimately and took down RackSpace, […]

[qi:014] Update: A truck driver because of a medical condition drove into a power transformer in San Antonio, Texas, this evening, causing it to explode. That explosion caused a major power disruption, and the power company in response cut power, which ultimately and took down RackSpace, our hosting company’s Dallas/Fort Worth based data center. Rackspace is based in San Antonio. This is the second time in less than a week that they have had power issues. Rackspace made the following announcement:

Without notifying us the utility providers cut power, and at that exact moment we were 15 minutes into cycling up the data center’s chillers. Our back up generators kicked in instantaneously, but the transfer to backup power triggered the chillers to stop cycling and then to begin cycling back up again—a process that would take on average 30 minutes. Those additional 30 minutes without chillers meant temperatures would rise to levels that could irreparably damage customers’ servers and devices. We made the decision to gradually pull servers offline before that would happen. And I know we made the right decision, even if it was a hard one to make.

Even though we are mostly hosted on WordPress.com, certain parts of the site are coming off of the RackSpace infrastructure. This prevented all our network sites from loading properly. Our email servers went down as well.

Everything seems to be back to normal, but it leaves me with one simple observation: our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track. A single truck driver can take out sites like 37Signals in a snap.

  1. Om, we let you and many others down tonight. Bad luck or not, we failed to deliver what we promise. We also learned a lot about needing to communicate more in real time with customers. We are determined to earn back the trust lost tonight. We hope our customers, including you, give us that chance.

    lew

    Share
  2. Data Centers are expected to have redundant power sources and backup devices. A truck should not be able to knock off a data center – otherwise it is not designed or planned well enough!

    Share
  3. Well, our team was engaging key investors this Sunday/Monday and was also in the middle of our biggest outreach program since launch. And Rackspace let us down on Sunday morning for four hours (no server, email, nothing…emails bounced back! We basically didn’t exist!). And we were not even notified when it happened!

    And then, after 24 hours of me (CEO) explaining the situation to countless people…and assuring them that it was a rare one-off circumstance that would never happen again…IT HAPPENS AGAIN. Our server is still down right now.

    In all seriousness, this could destroy a business. Rackspace’s whole “zero downtime” guarantee has actually been almost 10 hours of downtime in the past 48 hours (not to mention GREAT costs to the credibility and revenues of many businesses out there including my team).

    What corners have they cut with back-up systems, generators, etc!? Truly destructive .

    Share
  4. well, get redundancy in data centers. The problem is that redundancy is non trivial to implement on both the software side and on the interconnection side and will cost. How much is your business worth to you? If you can’t do it properly or costs too much to do it yourself, host the site on people that have implemented redundancy for you – google or amazon web services et al. As for the backup power supplies, if you don’t test, chances are the backup isn’t as redundant as you thought it was – batteries die, breakers don’t break, switches fail.

    Share
  5. Rackspace has always been and is a and extremely hyped service. Scratch the surface at Rackspace and there is no quality. If you have 2 servers then Rackspace is ok, else they are their so called support is not worth it. And now this amazing failure!

    Share
  6. [...] was covered by Laughing Squid, and made it onto a lot of big tech news sites such as TechCrunch, GigaOm, Valleywag, and O’Reilly Radar. 37 Signals and other well known web companies got wiped off [...]

    Share
  7. I’ve worked on and off with Rackspace for almost 7 years and true to their claim I’ve never faced serious downtime issues. They also have sat patiently while addressing problems during server migrations, etc. with my IT staff.

    Personally I couldn’t imagine the embarrassment suffered from a CEO attempting to showcase their online business to investors only to find their server’s gone MIA. However, I also know that Rackspace’s 100% uptime guarantee comes with a solid SLA. One that in times like last week they will make due on.

    I wouldn’t let years of trustworthy service erode so quickly.

    Share
  8. Rackspace has showed that they are a marketing gimmick on steroids with these outages. A single truck hitting a power pole taking out their data center shows their lack of redundancy planning.

    Share
  9. You can’t possibly expect 100% uptime for a single location, regardless of the redundancy built into that infrastructure.

    This is why multi-site architectures (failover or active-active) are used by every for whom downtime really matters. And it is also why the 100% guarantee for Rackspace is only 100% guaranteed to ensure you will have SLA refunds.

    Share
  10. Doesn’t RackSpace has UPS, diesel engines, and stuff … power outages can happen all the time. We’re testing this scenario once a month in our server rooms.

    Share
  11. Om, did you follow the outage ordeal Cynthia Brumfield and others underwent with Navisite? Talk about fragility, and talk about killing a business.
    http://www.ipdemocracy.com/archives/002761rackspace_twohour_unforeseen_outage_is_nothing.php

    Share
  12. I was also down for 3 hours last night.

    They shut down our servers due to the heat at the datacenter. From what we understand:
    In the second incident at approximately 6:30 PM CST Monday, a vehicle struck and brought down the transformer feeding power to the DFW data center. It immediately disrupted power to the entire data center and our emergency generators kicked in and operated as intended. When we transferred power to our secondary utility power system, the data center’s chilling units were cycled back up. At this time, however, the utility provider shut down power in order to allow emergency rescue teams safe access to the accident victim. This repeated cycling of the chillers resulted in increasing temperatures within the data center. As a precautionary measure we decided to take some customers’ servers offline. These servers are now back up, as are the chillers.

    So it seems as the redudant systems worked. With power and all, but the chillers failed when they had to cylce them multiple times because of the accident victim.

    Although all of our servers and our imaged suffered, I can’t say enough good things about rackspace and what they’ve done for us. I mean, with all my experiences with datacenters (esp The Planet) they handled everything as best as I can ask for. They’ve gone above and beyond with any support request me and my team have had and they are simply… Fanatitcal as much as I can expect them to be.

    Share
  13. RackSpace Outage Hits Home

    This story has been submitted to Stirrdup. Your support can help it become hot.

    Share
  14. Mari,

    yes i did. i know she went through some tough times. we got lucky to be down for a little while, in comparison

    Share
  15. We have hosted with Rackspace for some years now, and in my experience they have been growing wayy too fast, so the experienced, high quality administrators from 2 years back are just not accessible anymore. Instead the administrators that are supporting you have very superficial knowledge of the systems they are supposed to manage. In times of trouble, the B-team (as we call them) are not very reliable and in some cases they just panic.

    Share
  16. Amazing… we had a site launch yesterday, it went down just an hour after it was launched. (With a really happy client seeing it going down)

    We do have two dedicated servers in there, the funny thing is that both servers ended up having fried hard drives, and Rackspace performed restore in one of those with a faulty backup file… I mean, it could safer to launch from my computer at home!!!!

    Share
  17. Can I just ask, as a human being, anyone know if the driver is OK? I see all this stress over downtime – but a man is involved in an explosion and I haven’t found one report as to his state of health!

    Unless I’m missing something huge here, it makes me sad people now care more about virtual products than physical people.

    Share
  18. Rackspace’s “zero downtime” is a lie, as is their fanatical support that is outright terrible. My neighboer told me to call ntt/verio. I have already taken my services over to them. My advice is to call them – NTT/Verio at 866-341-7867 and ask for Bruno.

    Share
  19. I have to honestly say that web site hosting (rackspace, et. al.) hardly counts as Internet infrastructure IMHO … That said, the leaves (edges) of services are lined with single points of failure (services that are not redundant). But none of those services could reasonbly count as infrastructure IMHO …..

    Share
  20. I see all this negative hype that RackSpace is getting for a power outtage caused by a guy (supposedly) having a heart attack at the time of the crash.

    I start thinking “How different is this incident than your household electricity shutting off when lightning storm is in your neighborhood?” Sure, you get upset and immediately call the electric company because the outtage disruped your favorite TV show and you’ve been waiting for this episode for over a month.

    What are you going to do now, cancel service tomorrow and hook up with another electric service? Will that guarantee perfect service in a perfect world? Grow up!! Sh*t happens.

    When you get a flat tire, do you blame the highway department for letting debris get on the roads or do you jump right in and sue the tire manufacturer? Will that fix the flat? Give me a break.

    Share
  21. Its like a butterfly flapping its wings and causing a tornado, only in the online world we can trace it to the event that started the ripple.

    Share
  22. I’ve been with Rackspace nearly 6 years and this is a first for me. Even still, I only lost one server (my other is with them in San Antonio) for only about an hour last night. Rackspace was responsive and things were back online reasonably quickly.

    Share
  23. The outage occurred in Dallas, not San Antonio (it’s the second sentence in the article.) Rackspace is based out of San Antonio and has data centers all over the place.

    Share
  24. Uptime: Serious Business Indeed!

    Link: RackSpace Outage Hits Home � GigaOM. No need to pile on here, but clearly this is an issue that is going to become and more and more important going forward. It is not clear whether this could have happened

    Share
  25. Suddenly Leigh Anne wants his 99.999% advertisement removed :)
    Anybody who has worked in a datacenter knows that there is only so much you can get redundant without the cost rising.

    Share
  26. Matt: When I spoke to my account manager at Rackspace, I asked that very question. From what I understand, the driver is doing fine.

    Share
  27. This happened in Dallas not San Antonio!!

    Share
  28. Did anybody actually read Rackspace’s comments on what happend? The truck did not cause the outage. It was the power company.

    http://www.rackspace.com/information/announcements/datacenter.php

    6:30 PM CST Monday, a vehicle struck and brought down the transformer feeding power to the DFW data center. It immediately disrupted power to the entire data center and our emergency generators kicked in and operated as intended. When we transferred power to our secondary utility power system, the data center’s chilling units were cycled back up. At this time, however, the utility provider shut down power in order to allow emergency rescue teams safe access to the accident victim.

    Share
  29. A Voice of Reason Wednesday, November 14, 2007

    <>

    Dear Matt/Jim/Bruno,

    I’m truly impressed in the trust of your neighbor’s recommendation and your ability to negotiate a contract and migrate your services over to Verio in less than 24 hours.

    Next time you try to slam a competitor be sure you don’t leave your name and company URL in your signature file. Your post is full of lies (and spelling errors to boot).

    Nice try.

    Share
  30. Graham Weston, Chairman of Rackspace Wednesday, November 14, 2007

    Lew hit it right on the head Om. We let you and many of our customers down, and for that I am sorry. We will continually update our website here http://www.rackspace.com/information/announcements/datacenter.php so that customers can know what we are doing to fix the problem.

    Share
  31. We have hosted with Rackspace for 3 years and they have been fantastic. This outage really hit us hard though. It nuked the boot drive on a RAID array of our database server and our 200 customers were offline for 21 hours. We worked all night and all day to restore the DB environment. It has definitely shaken the trust of some of our customers. This could put a fragile company out of business.

    Share
  32. I need to add that placing all your trust in one place is dangerous. No one can predict or defend against every scenario. Rackspace is head and shoulders above any other hosting company I have used but that should never replace thorough and detailed disaster planning and testing. I am using this situation as an incentive to do just that.

    Share
  33. [...] looks like this is what happened with our host last night. They did fail over to the generator eventually, but they couldn’t do it instantly [...]

    Share
  34. [...] fall they suffered two outages due to power issues — the most colorful one was caused when a medically-incapacitated truck driver drove into a power transformer outside their data center. Today, 37signals suffered a two-hour Rackspace hardware outage in the [...]

    Share
  35. I was a big fan of Rackspace always telling anyone who were happy to listen, even to those not willing to ;-))

    But November was a first loss of 150% trust, then this week this is the end of it. We had our server potentially compromised, we then decided with the advice of Rackspace engineer to rebuild the server.

    Now it took far too long over a day, then we realise that we didn’t have backup after the 16th of January, so 5 days with no backup, and yes we do have Managed Backup with Rackspace.

    So now I am with a server which is partially restored, emails are back online but we have lost 7 days of them which is significant.
    And on top of that we have lost one very precious directory whereas the data was a reference and no other backup or copy because it was confidential and was supposively backedup.

    My question to rackspace is how come a Managed Backup remain un-noticed for 5 days. I have told them this of course. How come there is no alert defined if the volume backed-up is suddenly less than 50% of the normal volume ?

    All this to say that I am looking actively at the moment in finding another host for a dedicated server as my level of trust reach the bottom.

    I have actively defended rackspace at our board of director, but this time, I can’t see what excuse I can find for this.

    Sorry guys at Rackspace, but not good enough ;-(

    Pascal

    PS : I am not working for any competitor of Rackspace and I do not have any friend / family / acquaintance with any competitors or related companies to a competitor. I say that in case someone think that it might be the case.

    Share
  36. Oh, I forgot to mention that although I have a 2 weeks retention managed backup, it seems that 2 weeks in rackspace time dimension is only 10 days as they cannot find any backup prior to the 14th, considering that yesterday we were the 23, I guess 14 days back would mean what ? 9th of January.
    So another mystery and another reason for me to worry and get few more grey hairs which suit me, but still …

    I will keep you posted about progress

    Pascal

    Share
  37. Seems that rackspace made a mistake and it was a costly one. If they have been a good company for so long though I think closing the book on them may not necessarily be the best solution. Mistakes happen I guess is all I am saying. Thanks for posting this article!

    Share
  38. Marla Brady Monday, June 9, 2008

    We have have several issues with rackspace including this outage. It was pretty much the last straw for us. We moved to Server Intellect and never looked back. I can and will always unstand outages that are BEYOND the control of the host, but when your paying actively for backup services only to find out when disaster strike you have no backups that is a problem.

    Share
  39. [...] GigaOM, Data Center Knowledge, Valleywag, [...]

    Share

Comments have been disabled for this post