Architecting for Failure

In my last post, I outlined how difficult and nearly impossible a true 99.999 percent uptime is to provide. And, with some of the largest sites on the planet failing to deliver so-called “five nines,” is it even worth pursuing? I don’t think so, which is why I would like to propose an alternative called “architecting for failure.”

Approaching availability at the architectural level is critical. Most designs are based on multitenant architecture, which means they are oriented toward the convenience of the software vendor — not the customer. The distributed-tenant architecture, which allocates resources to customers dependent on their needs, is much more beneficial to the user. I believe that choosing how you decide to support your customers is more important than choosing how to support yourself.

In the world of providing Web-based applications and related services, servers break down, disk drives fail, equipment blows up. In all practicality, you can invest heavily in trying to maintain a service level of 99.999 percent uptime, or you can plan on failure and invest in restoration time and resiliency. I personally believe in the latter.

In my opinion, seven critical areas of thought in architectural preparedness exist that are more important than promises of 99.999 percent uptime:

  1. Scaling with new customer growth: Applications, users and accounts are all critical dependencies for architecting how you scale. What are you doing to meet the demand new customers put on your system?
  2. Scaling with existing customers: As existing customers add data, files and traffic, how do you rise to meet their demands? Many Web application providers scale for new users as they provision. But scaling your existing users as they add data is just as critical.
  3. Code distribution and updates: I have witnessed many decisions that are engineered toward the benefits of the vendor, at the expense of the customer. Have a system for pushing out updates that is beneficial to the customer, even if it is labor intensive.
  4. Resiliency and failover: How dependent is your solution on single points of failure that result in customer outages? Engineer your infrastructure so that a single point of failure does not impact all of your users or customers.
  5. Archival and restoration: Conventional developers usually have data backups consisting of zipping up files. While this is fine for your code, this is not a plan for customer data, especially as your customers scale. If this is your plan, how easy is it to restore data as customers scale? How do you retrieve important information in the event of an outage?
  6. Queries and code contributions: Rapid development tools in the hands of sloppy coders (myself included) can lead to unstable production solutions. What do you do about it? Database queries and poorly implemented client side scripting are big contributors to system availability in web-based applications.
  7. Cultural adoption, because “it’s” never over: It’s very easy to become complacent and lose the edge. It’s also easy to buy into industry best practices and not validate your decisions. This reminds me of when a DNS outage caused our first system-wide outage in our nine-year history. DNS was always there, making it easy to overlook. I would love to give that hour back to our customers. I can only encourage you to stay vigilant and pursue the lessons that you learn and implement better solutions to manage your disruptions.

I call it the “Architecture of Failure” because it’s based on the principle that failures happen and you need to deal with them. By designing for failure, you can provide service that is a much better benefit to your customers and users.

What’s better in your mind: 10,000 systems with 1 minute of outage each, or 1 system with a 10,000 second outage? (10,000 seconds is almost 3 hours.) I choose the first scenario because of the reduced overall impact on customer availability. I believe we can do better and the customers will shy away from mass adoption until we do.

What do you think?

loading

Comments have been disabled for this post