3 Comments

Summary:

James Urquhart continues his look into whether companies sacrifice stability by designing systems that value adaptability over strict top-down command and control. This is called the stability-resiliency tradeoff and, he argues, many complex systems benefit from adaptability.

What is the “stability-resiliency tradeoff,” and does it really exist? This — the third post in my series on devops, complex systems, anti-fragility and cloud computing — will focus on that question, the one that prompted Phil Jaenke to write the blog post that inspired this series.

Phil is a friend and he has an extensive background in systems and data center administration. His history is one of surviving the IT operations battles of the last decade or so. He cringes every time I mention concepts like continuous deployment, devops and — especially — the stability-resiliency tradeoff.

What is the stability-resiliency tradeoff?

The best description of the stability-resiliency tradeoff I can find is from C.S. Holling in a 1996 paper entitled “Command and Control and the Pathology of Natural Resource Management”:

“We call the result ‘the pathology of natural resource management’ (Holling 1986; Holling 1995), a simple but far-reaching observation defined here as follows: when the range of natural variation in a system is reduced, the system loses resilience. That is, a system in which natural levels of variation have been reduced through command-and-control activities will be less resilient than an unaltered system when subsequently faced with external perturbations, either of a natural (storms, fires, floods) or human-induced (social or institutional) origin. We believe this principle applies beyond ecosystems and is particularly relevant at the intersection of ecological, social, and economic systems.”

In other words, the more you try to control — or stabilize — a system by reducing variance in the system, the less resilient you make the system to unexpected events. Command-and-control, top-down approaches to complex systems operations actually make those systems more fragile in those cases.

Now, to be sure, for this principle to apply you must be talking about a complex system made up of many independent agents. Distributed applications are good examples of this. In fact, the more instances make up a distributed application, the more the principle applies.

And, of course, the entire portfolio of business and operational applications in a typical enterprise tends to create a higher-level complex system: Every agent (e.g., application, service or device) is connected to every other agent by just a few degrees of separation. (Just sharing a data center provides one common point of dependency for disparate software applications, for instance.)

Is there really a tradeoff?

Now, back to Phil. For him, stability is specific concept, and he thinks many have lost their way trying to repudiate it:

[It's not an] AND versus OR argument. There’s a lot of folks who have gone completely overboard with this idea that if you don’t do continuous deployment, you’re doing it wrong. And the simple fact of the matter is that they’re wrong. IT is not a zero sum game, nor is it strictly OR operations. Most organizations don’t want or need continuous deployment. And many organizations (e.g. Google who likes to break their infrastructure at the expense of paying customers and products) are doing it completely wrong.

Phil goes on to explore the definitions of “stable” and “resilient,” and to argue that what you want is stability and resiliency, as they are critical to running a business. He argues that these new models are detrimental to achieving both.

Read the post, and think about what he says.

Stability and resiliency in different contexts

If you have read the other posts in the series, especially the last one, you know that I disagree with Phil on his stances that “most organizations don’t want or need continuous deployment.” But as I read Phil’s post a second time, I began to see a distinction that I hadn’t considered before.

The stability being described in Holling’s paper is command-and-control approaches to squeezing variability out of a system. We see this in traditional approaches to architecture, ship building and other disciplines, where the belief that absolute control over all design parameters of a complex structure is both possible and mandatory for safety.

The stability that Phil introduces is also about “efforts to reduce the number of shocks the process or system is subjected to,” but he’s looking at it from the perspective of ending up with something that works after a failure. Which is also what resilience is all about, right?

So why would Phil argue so adamantly that stability and resiliency are interrelated? Perhaps because, from a hardware perspective, it’s obvious they are. At some level, stability is required for a CPU or — at the very least — a transistor. Higher-order systems may build resiliency around many such stable (e.g., unchanging) components, but some level of stability is desirable at some level within every complex system.

It’s just that trying to arbitrarily limit variance in the complex system is detrimental to the system. The system becomes less resilient as a result.

How should IT consider stability and resiliency?

Ultimately, perhaps the design decision is as simple as:

  • If the system being designed is best operated as a complex adaptive system, with highly independent agents and a dynamic structure, then the focus should be on resiliency and variance. The various teams that own the agents, the relationships between agents and the goals the business has for that system should drive this.
  • If the system is instead fairly static and/or made of few agents, choose a more prescriptive, static design approach. Limit what agents can participate in the system, and hardwire the relationships between agents so change is less dynamic (or not dynamic at all).
  • Since complex adaptive systems can be made of static components, the IT organization should be prepared for the full range of approaches needed to operate everything in the system.

(Keep in mind that the IT portfolio, if large enough, is itself a complex system, so resiliency and variance should dominate at that scale.)

While I disagree with Phil about the validity of continuous integration and continuous delivery approaches for enterprise software projects, I do agree that IT itself doesn’t have a stability/resiliency tradeoff. The tradeoff exists solely for each system structure, and not necessarily for the agents of those systems themselves.

My next post will turn its attention to practical matters, exploring sources of information to explore these topics further, and suggesting a few steps that enterprises can consider today with respect to devops adoption, resiliency and anti-fragility.

Thoughts? Comment below or find me on Twitter at @jamesurquhart.

James Urquhart is vice president of products at enStratus and a regular GigaOM contributor.

Feature image courtesy of Shutterstock user Olivier Le Moal.

You’re subscribed! If you like, you can update your settings

  1. Devops, complexity and anti-fragility – meet smart grid.

    The North American power grid is perhaps the premier example of complex adaptive systems made of static components.

    Building sufficient situational awareness in operational context remains one of the biggest challenges. Abstraction layers support interoperability and agility but also tends to hinder visibility of potential common mode failures.

    @bryansowen

  2. James, I really appreciate your thoughts on this topic. A systems view is increasingly important.

    One underlying business question is who should bear the cost of service quality?

    I liked your concept of three tier ops: App ops, service ops, and infrastructure ops. http://news.cnet.com/8301-19413_3-20016550-240.html

    Why should IT bear the cost of sorting this issue out to achieve 99.999 uptime — if developers assume 99% service availability (in SOA architecture) and service managers assume commodity infrastructure and abstraction that enables service switching?

    1. If I understand your question, to be sure the cloud model delegates the resiliency-stability decision making to various actors in the application stack (including services and infrastructure). In fact, if you read closely, this is sort of what AWS concluded in the post-mortem of the Christmas Eve outage: http://aws.amazon.com/message/680587/ .

      “First, we have modified the access controls on our production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval. Normally, we protect our production service data with non-permissive access control policies that prevent all access to production data. The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated.”

      A little more “command and control” needed for that infrastructure…

Comments have been disabled for this post