48 Comments

Summary:

Updated: A few weeks ago, I had a healthy and civilized debate with the gents from 37Signals (followed subsequently with a podcast with questionable sound quality) on the issue of scale and scalability. I laid out my case, they responded with theirs and so we went. […]

Updated: A few weeks ago, I had a healthy and civilized debate with the gents from 37Signals (followed subsequently with a podcast with questionable sound quality) on the issue of scale and scalability. I laid out my case, they responded with theirs and so we went.

Over past 48 hours, there have been reports of Web 2.0 outages. Six Apart, one of the biggest blog service provider is experiencing serious downtime, which has left many a few influential bloggers in a tears of rage. (maybe that’s why it hasn’t made it to Mememorandum and Tailrank as yet?)

The TypePad application is currently unavailable, which means that users will not be able to log in, and visitors to weblogs will not be able to post comments. We are working to bring TypePad back online as soon as possible.

“Plus they’ve put up people#s blogs from last Friday, implying that this is when they last backed up – could they truly be this incompetent?,” writes Russell Buckley, the mobile guru who co-authors Mobhappy with Carlo Longino. The company says it is deploying the back-up copies from 2 days ago, but Buckley disagrees and says, “This is not true. Our last post is from last Friday. That’s 7 days of data lost and I assume not backed up.”

TypePad has been growing so rapidly that it is finding the hard way that scale & scalablity matter. Are they the only ones?

Not really over past few days Bloglines, Feedster and WordPress.com have been behaving like a temperamental three-year-old with a flu.

(GrabPref is a great site to keep tabs on the performance of these service.)

Why even Photo Matt was off line. Here is what the Bloglines team had to say:

We’re not going to beat around the bush about this. Bloglines performance has sucked eggs lately. Why? In short, Bloglines has been busting at the seams like the Incredible Hulk.All of us here at Bloglines have been foregoing sleep and social lives over the past several months to keep Bloglines running and preparing for our move to a new access center (with bigger britches and a very elastic waistline).

Uptime data courtesy of Grabpref.

Niall Kennedy says he will be parked outside SixApart offices and bring regular updates on the big-blog meltdown. Bring him doggie bags people. Here is a link to his interview with Anil Dash

  1. [...] Note to Om: we have not been Memeorandum’ed because we can’t publish on our blogs! [...]

    Share
  2. Yeah, Typepad has basically shut down the business2blog today, and set it back to last Friday. This is extremely frustrating and completely unacceptable.

    Six Apart may be a “Web 2.0″ company, but it is facing some very Web 1.0 problems. Sort of reminds me of eBay’s early outages. Assuming that Six Apart is built on cheap off-the-shelf servers, I guess now we’ll see whether cheap scales.

    In the meantime, anyone out there know which hosted blogging service has the most industrial-strength offering, or is this pretty much what we are stuck with?

    Share
  3. Great bit information. There are two sets of numbers.

    Whenever average seek time is high it means one of two things. Product is designed poorly or
    Vendor does not have enough bandwidth ( This one is easy to fix $$$). On the other hand if availability is below 99% then most probably the service has lot of bottle necks in the design. Best advise look for alternate vendor.

    Share
  4. I would offer the problem isn’t that growth wasn’t considered from the start, rather that it hasn’t been *reconsidered* by some of these services frequently enough.

    Dare Obasanjo from Microsoft talks about how even the big guys have these issues from the position of a company that launched MSN Spaces and grew it to 3x LiveJournal in 1 year:

    “The fact is that everyone has scalability issues, no one can deal with their service going from zero to a few million users without revisiting almost every aspect of their design and architecture.”

    Share
  5. del.icio.us was down for about 5 hours the night before last also… I’m surprised nobody has talked about that. It crippled me since just about all of my bookmarks are there.

    Share
  6. I was going to mention the del.icio.us outage too — they said it would only be an hour, but it was several times that long. Something about moving servers or something — but it made me realize how much I use the damn thing.

    Share
  7. Typepad lost a few entries I had written as well, bigtime suckage.

    There’s a difference between a service being down for a little while and them actually losing your data. Absolutely ridiculous.

    Anil, you’re probably reading this, what are you guys going to do about the data loss?

    Share
  8. There are 2 reasons why I run my own blogging server (I use Blojsom) 1) because I am picky about where MY essays reside — my post are mine and I don’t like it being somewhere else, and 2) because I have control of the information, etc. — if something goes wrong, well it is my own fault.

    ceo

    Share
  9. Jason says “The fact is that everyone has scalability issues, no one can deal with their service going from zero to a few million users without revisiting almost every aspect of their design and architecture.â€? Sorry, I have to disagree. There are lot of load and stress testing tools to check your scalability. Rewrite many times before you launch the product not afterwards.

    Share
  10. Well, to be fair, Earthlink DSL for California was down all of yesterday. I’d rather have my browser give me a 404 error at a web 2.0 site than just show me the admin panel of my modem for 16 hours.

    Share
  11. “There are lot of load and stress testing tools to check your scalability. Rewrite many times before you launch the product not afterwards.”

    Sure, I can simulate 10,000,000 users using my product and break it, but it may take me 5 years to get that customer base. Why spend the time NOW to worry about something that *may* happen later? That takes time away from me improving my product in ways that benefit my customers NOW.

    Anyhow, I suspect we’ll just disagree eternally.

    Share
  12. Funny how the del.icio.us outtage hit just a few short days after Yahoo bought them. If it was a server migration, you’d think they could notify users.

    Share
  13. First of all, I think it can be confirmed that typepad is not running backups that are 2 days old, but rather from 12/9.

    The other other thing that I am finding especially irritating is that they should have some kind of banner notifying people viewing the blogs that they are looking at content that is not current. I had far too many emails this morning saying “hey your blog is a week old”.

    It is interesting to consider that the weak link in web 2.0 is in fact the hosting providers. When typepad goes down nobody can post which means memeorandum can’t accurately portray information amplitude.

    I’m done with Typepad, the performance has really sucked for the last 3 months but it’s another thing altogether when they essentially go offline and potentially lose my content. It’s been said that everyone has scalability problems, but the corollary to that is that scalability problems have been solved before so why are we still having them in these consumer services?

    Share
  14. jeff

    you bring up good points, but these problems are not exclusive to typepad. i know how frustrating this can be, but i think most other hosted services are going to have these issues as well.

    have you checked out squarespace.com

    Share
  15. I Can Post Today … Can You?

    And, yep, my posts since last Friday are still up. Are yours?

    Share
  16. The weak link in the whole web is the hosting providers, period. Redundancy is expensive, yes, but common sense needs to kick in at some point. It appears that they have one disk storage sub-system that went down last night. ONE???? How about an inexpensive clustering solution? How about a warm spare with a two-hour old snapshot? Not even talking about multiple data centers, which anyone who’s serious about uptime is going to be looking at. With the fees they’ve refunded already from the problems in November, they could probably purchase a mid-range SAN solution. I don’t pretend to know SA’s business metrics, but this has got to be costing them big time, both in dollars and in trust. Architecting a seriously reliable hosting model is HARD, believe me, but the paybacks are worth it.

    Share
  17. Seems you brought up a fairly timely topic recently, Om.

    I think the key in the capacity planning debate can be resolved through each individual service provider analyzing the following factors:

    1) Customer expectations
    What level of service reliability is expected by your customers? This is driven around factors such as if they pay for the service, how much they rely on it throughout the day, etc.

    2) Financial position and strategy
    Some smaller companies may like to stay lean and not risk unnecessary infrastructure investments

    3) What damage will outages do to customers’ brand loyalty

    4) If the site is down, do you lose revenue for every minute down? (eCommerce sites)

    There really is no black and white answer to this, but anyone would have to admit, frequent or extended service outages are totally unacceptable by most users.

    Of course, there are mathematical models and other somewhat complex exercises available from operations management science to use in these scenarios, but my guess is they are rarely relied upon.

    Share
  18. Hmmm I am seeing a lot of complaining and criticizing going on in here. Being someone who has actually had to run both an ISP and all the servers for said ISP the issue of scalability is an extremely complex problem to address.

    I see that someone here has the idea that simulating a “few million users” is something that is supposedly easy or simple. Easy is a highly subjective term depending on what kind of resources, time, and most importantly support for management one has. It is possible to break any web service or site if you throw a large enough load at it, period.

    What you are trying to do instead is run a system that has its load evenly balanced between all the parts that comprise it. Too much dependance on one parts performance and it doesnt matter how good your other parts are if something breaks. See, the big issue is that these services are complex entities and usually all it takes is just one piece breaking to bring the rest crashing down. You could have a disk failure, or your DB software could tank, or the DB could not be able to talk to the SAN, or your web server could have problems talking to its backend. Or your load balancer could go tits up. Or your ingress router could not be able to handle the sustained packets per second of traffic. Or a backup process could be blocking write access to a critical file.

    All of these problems would result in the same kind of problem in the end: site unavailable. Safeguarding against any one of the potential problems is possible with time and money and people. Making sure all components are safeguarded is harder and more involved obviously.

    However the biggest issue is that testing failure modes is a massive pain in the ass. What are you going to do to simulate say 10 million customers hitting your site? Typically, the amount of equipment required to do testing on that scale costs as much or sometimes even more then the system you are implementing! Hell just look at how expensive SmartBits test equipment is for massive traffic loading.

    This doesnt even go into the issues of wether or not a test was performed in conditions that would happen in the real world. For example, your server might take 10 million customers hitting its front page, but what if that was 10 million people hitting the site and all looking for something different? Totally changes what subsystems are stressed and to what levels.

    Typically, the best many folks can do is build it as best as they think they can and then throw it out into the real world and see what happens and then do tweaks to it as they learn how the system reacts to different inputs. The number of people out there who understand how to build a website capable of scaling to massive loads is small because it all depends on the kind of site and the services it provides and relies on and one has to become a subject matter expert in all the parts of the site so as to best design it.

    Share
  19. Just to be clear, despite displaying cached blog pages, there is no data lost on TypePad. we’ll be republishing the pages to bring them current now that the service is back up.

    Share
  20. Have things really not improved that much since Ebay was having all of their scalability issues five or so years ago? That’s kind of surprising to me.

    Share
  21. The biggest issue in my mind is whether Web 2.0 companies are going to survive their success or not. SixApart has known it had reliability issues for months, and it’s had funding to hire some top people to build a world-class infrastructure. Has it done so?

    Maybe… and maybe the company needs to hire its own Meg Whitman, someone who has experience running a much bigger company, to help them grow SixApart?

    Share
  22. Putting up my URL at this time is ironic – it’s on 6A TypePad. It is true that I can reach TypePad. It is not true to say that means sites have been updated when you hit ‘View Website.’ 6A explains this and that’s fair enough.

    The issue for me is this has been the case since around 0600 CET – that’s 9pm PT I think. Yet there was virtually no online coverage of the issue until around 1700 CET when The Register posted a quick thingy on it. A number of us in Europe were left scratching our heads with little clue as to what has been going on. We still don’t know.

    On the so-called blogosphere? Zippo. Nada. Nix. Nothing.

    MSM? Zippo again.

    So tell me this. Just how influential is this media? Really. Truly. Honestly.

    And on scalability – you’re right Om. I saw the red light when Canter and Ismail started talking about Structured Blogging a few days ago. In hindsight, I wish I’d listened a little more closely to the ‘oh-oh’ antennae.

    Anyone want to speculate how much this has shifted any ‘tide’ towards OSS back towards MSFT?

    Share
  23. TypePad goes awol

    TypePad has been down all day today. The updates on Everything TypePad were, of course, down with the rest of the system. I’m amazed that, right after a huge move to a new data centre, ‘During routine maintenance of our

    Share
  24. Tap Tap…Is This Thing On? — Typepad Fails Us, Again

    The last 24 hours have been massively disappointing as a Typepad user. Let me be clear…the Six Apart team has been very open and apologetic about the issues at play. From the most recent update:We want you to know that

    Share
  25. Om…kudos for bringing this issue to the forefront. Like hundreds of thousands of other bloggers, I was out of commish today and frankly, I’m livid.

    Apologies and contrition are nice, but that’s simply not acceptable given the magnitude of the outage. NOT GOOD ENOUGH…

    http://woodrow.typepad.com/the_ponderings_of_woodrow/2005/12/tap_tapis_this_.html

    Share
  26. Tough situation – I can see both sides of the argument regarding scaling.

    http://www.howradical.com/articles/2005/12/16/scaling

    Share
  27. There is no ‘two-sides’ to scalability. You either scale or you don’t.

    Share
  28. Okay so Sixapart don’t back up, but does Google? Any sad bloggers checked the cache, or maybe archive.org?

    Share
  29. Current Issues with TypePad Posted by Michael Sippey in On Typepad website

    “During routine maintenance of our network and storage systems last night, we experienced an issue with our primary disk system where data from published blogs are stored.”

    I guessing here, looks like the data store is not distributed. Every time something happens to that primary disk system, typepad most probably will go down. This also might lead to scalability issues down the road if they acquire more customers.

    Share
  30. I have a modest proposal.

    Each time you, Om, or other prominent bloggers use that frickin’ empty phrase “Web 2.0,” you should donate $20 to a non-denominational, non-partisan charity. No limits. Use the phrase 4 times in one blog entry, owe $80. (but, in a moment of kindness towards you, I won’t hold you responsible for mentions in resyndication and such)

    The outages have absolutely *NOTHING* to do with any Web 2.0′ish, Web 1.9′ish, Web 2.1713′ish etc. A few *very visible* companies have been had outages recently.

    I know, I know, that makes for a much less juicy title: “Many popular companies have recently experienced outages”… but it makes me wretch a whole lot less than reading yet another “Web 2.0″ headline.

    I mean, seriously, what next? “Web 2.0 causes divorces.” “Web 2.0 responsible for illiteracy in Namibia.” “Politicans ignore Web 2.0 issues in recent debate.”

    For crying out loud… can we just talk about companies on their own merits and stop trying to classify and rename and illogically group stuff?

    MUCH thanks in advance!!!

    Share
  31. Why spend the time NOW to worry about something that may happen later? That takes time away from me improving my product in ways that benefit my customers NOW

    Because it may well kill your business. Same way you plan cashflow. Typepad not taking scalability seriously and having a bulky and innefficient app – not to mention the skills required to run a large scale hosted service will lose them more and more customers in the long run.

    As I mentioned in the other post startup businesses have to be tight with server and co-lo providers because it is an essential part of any web business.

    Also, getting to 99.9999% reliability is more of an issue of skill rather than money. Do you think Microsoft solved the Spaces problems by throwing lots of money at it, or did they use their experience in running large-scale infrastructure to sort the problems out?

    Feedlounge is such a good example of this, poorly tested and poorly planned and now almost a year later still no product.

    Benchmark your applications early and plan your expansion based on that. Using ‘ab’ I can measure how much resources an app requires and plan based on that in less than an hour – keep pushing up the number of concurrent connections till your server maxes out, now work out how many hits on average from each user per day and work out your peak times to come to a total number of hits per minute. Divive this up with you ab results and you get an idea of how many servers you need. Use round-robin DNS, spread out your databases and replicate.. replicate again to ‘warm’ servers. If your app is maxing our a server with only 10 concurrent hits then it is time to re-evaluate your architecture and how your application has been put together. Response times should be

    Share
  32. TypePad takes a break … again

    Six Apart’s problems are getting bigger than a spat at LesBlogs.
    They’ve grown bigger and bigger and their high profile means TypePad’s frequent dirtnaps have moved beyond the echo chamber and are now all over the pages of Forbes.

    Share
  33. Saturday Morning Maintenance
    “Flickr is having a massage.”
    http://blog.flickr.com/flickrblog/2005/12/saturday_mornin.html

    Share
  34. Is this a trend?

    “Due to some last minute server issues / unfortunate catastrophic failures / tsunami / hurricane / earthquake / bad sushi / and plain old shit hitting the fan a few of our boxes have had ‘unreliable’ service in the past few hours. We’re working on it, and should have things back to normal before too long. Yuck.”
    http://www.metroblogging.com/news/old/2005/12/server_outtage.phtml

    Share
  35. [...] There’s been no sign of a mass migration from either service. On the other hand, Six Apart isn’t alone with system failure this week. Om Malik points to a number of on-line services that disappointed users lately. Over past few days Bloglines, Feedster and WordPress.com have been behaving like a temperamental three-year-old with a flu. (GrabPref is a great site to keep tabs on the performance of these service.) [...]

    Share
  36. Buzznet.com is down for maintenance…

    Share
  37. [...] Om Malik ã?Œ The Web 2.0 Hit By Outages ã?¨ã?„ã?†è¨˜äº‹ã‚’書ã?„ã?¦ã?„ã?¾ã?™ã€‚TypePad ã?®å¤§è¦?模障害ã?¯ mara ã?Œä¼?ã?ˆã?¦ã‚‹ã?¨ã?Šã‚Šã?§ã?™ã?Œã€?Bloglines も今週後å?Šã?«ãƒ‘フォーマンスã?®ä½Žä¸‹ã?Œè¦‹ã‚‰ã‚Œã?¾ã?—ã?Ÿ (確ã?‹ã?«æ™‚々RSSã?Œå¿œç­”ã?—ã?ªã?‹ã?£ã?Ÿã?“ã?¨ã?Œã?‚ã‚Šã?¾ã?—ã?Ÿ)。wordpress.com もã€?11月åˆ?旬ã?«æ´¾æ‰‹ã?«è?½ã?¡ã?Ÿã?»ã?‹ã€?先週ã?«ã‚‚å°?è¦?模ã?®ãƒ€ã‚¦ãƒ³ã?Œã?‚ã‚Šã?¾ã?—ã?Ÿã€‚digg.com も最近時々ã?Šã?‹ã?—ã?ªæŒ™å‹•ã?Œè¦‹ã‚‰ã‚Œã?¾ã?™ã€‚ [...]

    Share
  38. I wonder how many people here bitching about outage are on free accounts? And how many have never had to reboot a live system because something got snarled up? Stuff happens, people – get over yourselves. No-one died, fer gossakes.

    Share
  39. Because it may well kill your business.

    No, it won’t. Online services *never* go out of business because of system scalability issues. They go out of business ALL THE TIME from not getting enough customers and not making money.

    It crippled me since just about all of my bookmarks are there.

    Would you listen to yourself? “Crippled” because your social bookmarks are down for a few hours???

    Share
  40. [...] Finally, Om Malik has a nice article on the Web 2.0 leader’s uptimes. Surprisingly, blog host startup WordPress.com (based off of the WordPress.org software) was second only to Typepad in the most downtime. Other notable Web 2.0 startups with abnormal downtimes were Feedster (88.13% availability) and Blogdigger (89.23% availability). [...]

    Share
  41. Web 2.0 Meltdown?

    Over the past few days Web 2.0 technologies have received a lot of press, and not all of it is good. Are Web 2.0 upstarts skimping on scalability in hopes to get their product out the door faster? I’ll take a look at some of the big names in the…

    Share
  42. [...] GrabPERF Link Love December 16, 2005 Om Malik gives GrabPERF some link love today. [here]   The GrabPERF datacenter is currently lit up like a Christmas tree.   Technorati Tags : GrabPERF, Web+performance, Om+Malik [...]

    Share
  43. [...] Yesterday, Om Malik gave GrabPERF some link love. [here] [...]

    Share
  44. [...] After the big blog blackout of December 2005, many people are looking for options. WordPress.com is adding nearly 500 new blogs a day, and more recently Mr. RSS Dave Winer has started to play in that sandbox. But that’s not all. WP also has snagged a deal with Yahoo for hosting, much like Six Apart. The much awaited, WordPress 2.0 is finally coming out from under the covers, and is chock-a-block with Ajaxy goodness. (Good timing, don’t you think Jeff!) Also just announced, pMachine’s ExpressionEngine 1.4, a premium product I have become a fan of lately. pMachine folks are also offering a new free version called ExpressionEngine Core. One thing, which i would like to point out, TypePad, despite last week’s problems remains one of the easiest and simplest to use tools out in the market. Those who were affected by last week’s outage could easily take their data and port it to MT installed on Yahoo servers. (Pretty reliable, I presume!) [...]

    Share
  45. Why would anyone in business and fed up with the TP service chose to move it to what amounts to another ISP? Surely the better alternative would be to think about this as an opportunity to evaluate the landscape and think about their aspirations around this medium.

    Share
  46. I think blogosphere is increasing at a high speed and nothing is melting down. It’s becoming an alternative source of media. Whichever one prefers.

    Share
  47. I was going to mention the del.icio.us outage too — they said it would only be an hour, but it was several times that long. Something about moving servers or something — but it made me realize how much I use the damn thing.

    Share
  48. One of the issues Web 2.0 companies have in building solutions on the cheap is that they don’t plan for real scalability of their infrastructure. Which means that they melt down whenever their traffic/audience grows faster than their ability to add servers/gears.

    Share

Comments have been disabled for this post