8 Comments

Summary:

Most web companies publish detailed post-mortems on site interruptions. These are usually an attempt to appease impatient customers and share a lesson the company learned. The write-ups by Facebook and Foursquare provide a few business and technical lessons that we should note.

In the last few weeks, lengthy outages at Facebook and Foursquare left site users anxious and industry watchers curious about how web applications will maintain greater uptime. The more we rely on these services, the less we tolerate any interruptions, even if the service is free.

With such large public audiences to keep informed, most web companies publish detailed post-mortems on site interruptions. These are usually an attempt to appease impatient customers and share a lesson the company learned. The write-ups by Facebook and Foursquare provide a few business and technical lessons that we should note.

On the business side:

Customers expect companies to publish a post-mortem after something goes wrong. Transparency is a winning strategy, and companies that issue a quick, genuine apology and explanation earn the trust of their users and retain them through good times and bad.

Maintaining a status blog goes a long way. Foursquare launched theirs, as a Tumblr microblog, after their recent outage. Between that and a Twitter support address, they feel they have enough coverage on customer communications channels. The company’s responsiveness on these channels helps maintain customer loyalty.

On the technical side:

Pushing performance with caching always leaves you a bit exposed. Web companies use all types of caching strategies to keep data in fast memory as opposed to slower disk. But having a second copy in cache adds an extra variable that can catch you by surprise. In the case of Facebook’s outage, they had a system that tried to repair inconsistencies between the persistent storage and cache, but this health check itself failed, causing more harm than good. Caching isn’t going away, but the rise of flash-based storage solutions offers an opportunity to simplify the tiers.

Sharding stinks, but it is a necessary evil. In the case of Foursquare, their primary issue was an overloaded database instance. With large sites, companies spread large data sets across many smaller databases in a process called sharding. Sometimes they will split data by user ID: A through E in database 1, F through J in the next. When Foursquare found that one instance was being overloaded with check-ins and they attempted to split some of the load of just that instance, their entire site was adversely affected.

Foursquare uses MongoDB, a “document database” that also falls into the NoSQL category. One of the themes behind the NoSQL movement is scale, and while this kind of event should not be misinterpreted, it does beg the question about what might be needed to improve newer datastores. For those who want to dig even one level deeper,, there’s a post on the MongoDB user group from 10gen, the company behind MongoDB.

It is great to see leaders like Facebook and Foursquare put the attention to these post-mortem write ups. The collaborative nature of supporting companies helps as well. No doubt the web has a long way to mature, but these folks are showing us how to get there.

Gary Orenstein is host of The Cloud Computing Show

Photo courtesy of Flickr user tanakawho.

Related content from GigaOM Pro (subscription req’d):

    You’re subscribed! If you like, you can update your settings

    1. Here’s my take:

      Facebook is HUGE and is expected to have issues from time to time.

      Foursquare is small and is NOT expected to have massive outages like the one they had.

      1. An argument against your theory is that because Facebook is huge, can afford to hire the best and brightest and should not have massive outages. Foursquare is small and can’t keep up with their sudden popularity.

    2. NoSQL Startup NorthScale Becomes Membase Inc.: Cloud « Monday, October 11, 2010

      [...] much to its corporate structure as to its technology. As Phillips points out, sometimes customers want the proverbial throat to choke, even with an open source database. Because Membase Server is the intellectual property and source [...]

    3. Great article with tips around how to architect these services. However, I think a key point being missed here is TESTING.

      Testing solutions exist today that can help these services prepare for demand spikes coming from an ever increasing variety of end user platforms. The testing solutions are agile and can help both proactively identify bottlenecks, as well as help app and service providers more quickly recover from production issues.

      A video of one such solution is shown here in the context of scale testing a Couch DB application. http://bit.ly/bi6xpO

    4. New Relic Gets Another $10M, Proves SaaS Profitability: Cloud « Thursday, October 14, 2010

      [...] view helps customers identify problems early and avoid “the thorniest issues,” citing the recent Foursquare outage as a problem that might have been avoided if the company could have been proactive in addressing [...]

    5. Hey Shareholders, Capex Means Cash in the Cloud!: Cloud « Friday, October 15, 2010

      [...] and Twitter (even Foursquare) so much that it resembles a national crisis among some circles when one of their sites goes down for a few hours. How do they try to avoid these occurrences in the future? Well, they spend some of [...]

    6. Will Scalable Data Stores Make NoSQL a Non-Starter?: Cloud « Monday, October 18, 2010

      [...] MongoDB and Riak are maturing fast and gaining in popularity. But they’ve also been the cause of some noteworthy outages as of late. Perhaps these are just growing pains, but try telling that to most [...]

    7. What Was the Biggest Internet Outage of 2010?: Tech News « Tuesday, December 28, 2010

      [...] down for about a third of the Facebook subscribers on day one, and nearly 66 percent on day two. Cause: Third party network provider. Facebook had a major outage in April 2010 as [...]

    Comments have been disabled for this post