11 Comments

Summary:

As I write this, Amazon’s S3 storage servers have been unreachable for 90 minutes. As was the case back in February the last time this happened, the outage is apparent by chunks of Web 2.0 dropping off: the most visible indication of trouble for many people […]

ScreenshotAs I write this, Amazon’s S3 storage servers have been unreachable for 90 minutes. As was the case back in February the last time this happened, the outage is apparent by chunks of Web 2.0 dropping off: the most visible indication of trouble for many people was the sudden vanishing of pictures from Twitter. Poke around some, though, and you can find plenty of other services that are gasping for air right now. That’s not counting those of us who use S3 personally, for things like backups via Jungle Disk or the equivalent.

Amazon learned from the last outage that transparency is a must. If you visit the Service Health Dashboard, you can see that they know about the outage and are “pursuing corrective action” – though they have not yet announced an ETA for a fix. At least we know they’re working on it, though that’s cold comfort for startups who built their business around S3. Fortunately Amazon’s EC2 cloud is unaffected, so we’re not seeing swathes of servers vanish from the net.

Amazon does offer an SLA for the S3 service, guaranteeing 99.9% uptime or part of your money back. With .1% of a month being around 45 minutes, that means they owe people money. The requirements for claiming a refund, though, are onerous enough that no one except large users will bother (hey, Amazon, how about an automatic refund when you know your servers are down?).

With two relatively serious outages in the space of 6 months, some will be asking the question of why depend on S3? The answer is simple: the rates are hard to beat, especially for service that doesn’t require any sysadmin budget. The fact remains that no other giant has started offering commodity storage at similarly attractive prices, though it’s obvious that Google or even Microsoft could get into the game if they wanted to.

As long as Amazon has a virtual monopoly of cheap, distributed, fast, API-accessible storage, web startups will continue to depend on them. If you need to offer your customers better than 99.9% uptime for some reason, though, it’s clear that you need to have a backup plan for those times when the Amazon infrastructure is having issues. Fortunately, most of us can live without uptime over that level.

  1. How about switching to nirvanix.com? Anyone has some expierence with them?

    Share
  2. I’d be curious as to what others may use for a backup solution. I think this is the kick I need to go find something that I can “switch” to during times like this. We aren’t huge by any means, but we do pay around $100 per month to Amazon. I’d be willing to some more to someone else in order to have a similar setup if needed.

    So, I could have pictures.example.com point to amazon and picturebackup.example.com point to this other provider.

    So, does anyone know who this “other” provider could be?

    Share
  3. It’s safe to presume that S3 and related services are going through a phase of rapid expansion – hitting walls as they push out further and introducing defects as their environment changes. All new systems go through a similar adolescent phase so I’m not extrapolating this into a trend, yet.

    I agree SLAs are not useful as they don’t present expectations accurately and the recourse is always insignificant in comparison to the lost incurred. Businesses make decisions based on reputation e.g. “how good are you” is more useful than “how good do you promise you’ll be”.

    Anyway, S3 et al better be scrambling to tweak their architecture or I’m switching to Plan B.

    Share
  4. Right, but what is “Plan B”??

    Share
  5. We did a lot of testing with S3 vs Nirvanix vs internal and overall Nirvanix scored quite well. They were a viable option.

    We went with the big name however because S3 had the performance edge and stable API. Nirvanix was relatively new (still in beta) and seemingly still making a lot of changes (enhancements) to their service. Nirvanix did however have more features that were useful to us. They seemed to have an issue with URL redirection that was causing delays (a big issue if you are serving up a lot of small files). We also could not get their desktop client working on any of our systems (seemed to be not well tested / distributed). We felt if they loaded up quickly with customers we might be in for a rough ride as they tried to scale (only a perception).

    Share
  6. You called it, Nirvanix or we have the option to scale internally too.

    Share
  7. In looking at Nirvanix, I think we might as well host our files there as a backup at the very least. We could then just “throw a switch” at a time like this to start serving files from there.

    Our main cost is with transfer (not storage), so I think this would be very cost effective.

    I guess we would need to make sure all new uploads went to the provider we “threw the switch to” and then make sure everything is in sync before going back to S3.

    Share
  8. Our service (PhotoShare, a photo sharing application for iPhone) also uses S3, and we had a very hectic sunday. Here is a quick summary.

    Amazon S3 outage affected PhotoShare too

    Share
  9. In regards to Nirvanix, their service works great and they offer customer service, so when I have a problem I can call them up, unlike S3 where they respond to customers within a blog, very customer friendly right?

    Doesnt it seem like S3 goes down every few months? If you’re a startup that could cost you major $$ and any hiccups in an early stage company can be huge.

    Share
  10. Example:

    “As part of the constant effort to provide customers with cutting-edge features and technologies, the Nirvanix team continues to expand its global Storage Delivery Network (SDN), and is making great strides towards simplifying cloud storage. With this in mind, Nirvanix has a new release scheduled to be put into production on 7/23/08 (Wednesday). The time window for the update is 10:00 am – 4:00 pm Pacific Standard Time.”

    “Please be aware that during this designated upgrade window you may experience intermittent slowness, or delayed system response within our network. We know that the day-to-day operations of your business are critical, so we have set aside this timeframe in an effort to reduce any potential effects on your processes and procedures.”

    Share

Comments have been disabled for this post